Explorar o código

S2 beta (#1165)

Update for new model S2-Pro, and more.
---------

Co-authored-by: PoTaTo-Mika <1228427403@qq.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Whale and Dolphin hai 1 mes
pai
achega
b72bcb3163

+ 66 - 18
README.md

@@ -34,7 +34,7 @@
     <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
       <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
 </div>
@@ -47,32 +47,57 @@
 > **Legal Disclaimer**  
 > We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
 
-## Start Here
+## Quick Start
 
-Here are the official documents for Fish Speech, follow the instructions to get started easily.
+### For Human
+
+Here are the official documents for Fish Audio S2, follow the instructions to get started easily.
 
 - [Installation](https://speech.fish.audio/install/)
-- [Inference](https://speech.fish.audio/inference/)
+- [Command Line Inference](https://speech.fish.audio/inference/#command-line-inference)
+- [WebUI Inference](https://speech.fish.audio/inference/#webui-inference)
+- [Server Inference](https://speech.fish.audio/server/)
+- [Docker Setup](https://speech.fish.audio/install/#docker-setup)
 
-## Fish Audio S2  
-**Best Text-to-speech system among both open source and closed source**
+For SGLang server, please read [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md.)
+
+### For LLM Agent
+
+```
+Install and configure Fish-Audio S2 by following the instructions here:https://speech.fish.audio/install/
+```
 
-Fish Audio S2 is the latest model developed by [Fish Audio](https://fish.audio/), designed to generate speech that sounds natural, realistic, and emotionally rich — not robotic, not flat, and not constrained to studio-style narration.
+## Fish Audio S2  
+**Best text-to-speech system among both open source and closed source**
 
-Fish Audio S2 focuses on daily conversation and dialogue, which enables native multi-speaker and multi-turn generation. Also supports instruction control.
+Fish Audio S2 is the latest model developed by [Fish Audio](https://fish.audio/). Trained on over 10 million hours of audio across approximately 50 languages, S2 combines reinforcement learning alignment with a Dual-Autoregressive architecture to generate speech that sounds natural, realistic, and emotionally rich.
 
-The S2 series contains several models, the open-sourced model is S2-Pro, which is best model in the collection. 
+S2 supports fine-grained inline control of prosody and emotion using natural-language tags like `[laugh]`, `[whispers]`, and `[super happy]`, as well as native multi-speaker and multi-turn generation.
 
-Visit the [Fish Audio website](https://fish.audio/) for live playground.
+Visit the [Fish Audio website](https://fish.audio/) for live playground. Read the [blog post](https://fish.audio/blog/fish-audio-open-sources-s2/) for more details.
 
 ### Model Variants
 
 | Model | Size | Availability | Description |
 |------|------|-------------|-------------|
-| S2-Pro | 4B parameters | [huggingface](https://huggingface.co/fishaudio/s2-pro) | Full-featured flagship model with maximum quality and stability |
-| S2-Flash | - - - - | [fish.audio](https://fish.audio/) | Our closed source model with faster speed and lower latency |
+| S2-Pro | 4B parameters | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Full-featured flagship model with maximum quality and stability | 
+
+More details of the model can be found in the [technical report](https://arxiv.org/abs/2411.01156).
+
+## Benchmark Results
+
+| Benchmark | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (Chinese) | **0.54%** (best overall) |
+| Seed-TTS Eval — WER (English) | **0.99%** (best overall) |
+| Audio Turing Test (with instruction) | **0.515** posterior mean |
+| EmergentTTS-Eval — Win Rate | **81.88%** (highest overall) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — Quality | **4.51 / 5.0** |
+| Multilingual (MiniMax Testset) — Best WER | **11 of 24** languages |
+| Multilingual (MiniMax Testset) — Best SIM | **17 of 24** languages |
 
-More details of the model can be found in the technical report.
+On Seed-TTS Eval, S2 achieves the lowest WER among all evaluated models including closed-source systems: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). On the Audio Turing Test, 0.515 surpasses Seed-TTS (0.417) by 24% and MiniMax-Speech (0.387) by 33%. On EmergentTTS-Eval, S2 achieves particularly strong results in paralinguistics (91.61% win rate), questions (84.41%), and syntactic complexity (83.39%).
 
 ## Highlights
 
@@ -80,11 +105,34 @@ More details of the model can be found in the technical report.
 
 ### Fine-Grained Inline Control via Natural Language
 
-Fish Audio S2 enables localized control over speech generation by embedding natural-language instructions directly at specific word or phrase positions within the text. Rather than relying on a fixed set of predefined tags, S2 accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.
+S2 enables localized control over speech generation by embedding natural-language instructions directly at specific word or phrase positions within the text. Rather than relying on a fixed set of predefined tags, S2 accepts free-form textual descriptions — such as `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]` — allowing open-ended expression control at the word level.
+
+### Dual-Autoregressive Architecture
+
+S2 builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate). The Dual-AR architecture splits generation into two stages:
+
+- **Slow AR** operates along the time axis and predicts the primary semantic codebook.
+- **Fast AR** generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.
+
+This asymmetric design — 4B parameters along the time axis, 400M parameters along the depth axis — keeps inference efficient while preserving audio fidelity.
+
+### Reinforcement Learning Alignment
+
+S2 uses Group Relative Policy Optimization (GRPO) for post-training alignment. The same models used to filter and annotate training data are directly reused as reward models during RL — eliminating distribution mismatch between pre-training data and post-training objectives. The reward signal combines semantic accuracy, instruction adherence, acoustic preference scoring, and timbre similarity.
+
+### Production Streaming via SGLang
+
+Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, S2 directly inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.
+
+On a single NVIDIA H200 GPU:
+
+- **Real-Time Factor (RTF):** 0.195
+- **Time-to-first-audio:** ~100 ms
+- **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
 
 ### Multilingual Support
 
-Fish Audio S2 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing. Including:
+S2 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing. Including:
 
 **English, Chinese, Japanese, Korean, Arabics, German, French...**
 
@@ -92,19 +140,19 @@ Fish Audio S2 supports high-quality multilingual text-to-speech without requirin
 
 The list is constantly expanding, check [Fish Audio](https://fish.audio/) for the latest releases.
 
-### Native multi-speaker generation
+### Native Multi-Speaker Generation
 
 <img src="./docs/assets/chattemplate.png" width=200%>
 
 Fish Audio S2 allows users to upload reference audio with multi-speaker, the model will deal with every speaker's feature via `<|speaker:i|>` token. Then you can control the model's performance with the speaker id token, allowing a single generation to include multiple speakers. You no longer need to upload reference audio separately for each speaker.
 
-### Multi-turn generation
+### Multi-Turn Generation
 
 Thanks to the expansion of the model context, our model can now use previous information to improve the expressiveness of subsequent generated content, thereby increasing the naturalness of the content.
 
 ### Rapid Voice Cloning
 
-Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
+Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning. Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the sglang server.
 
 ---
 

+ 61 - 12
docs/README.ar.md

@@ -34,7 +34,7 @@
     <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
       <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
 </div>
@@ -47,32 +47,57 @@
 > **إخلاء المسؤولية القانونية**
 > نحن لا نتحمل أي مسؤولية عن أي استخدام غير قانوني لهذا المشروع. يرجى الرجوع إلى القوانين المحلية المتعلقة بحقوق الطبع والنشر الرقمية (DMCA) والقوانين الأخرى ذات الصلة.
 
-## ابدأ من هنا
+## البدء السريع
 
-هذه الوثائق الرسمية لـ Fish Speech، اتبع التعليمات للبدء بسهولة.
+### ابدأ من الوثائق
+
+هذه هي الوثائق الرسمية لـ Fish Audio S2، ويمكنك البدء مباشرة عبر الروابط التالية:
 
 - [التثبيت](https://speech.fish.audio/ar/install/)
-- [الاستنتاج](https://speech.fish.audio/ar/inference/)
+- [الاستدلال عبر سطر الأوامر](https://speech.fish.audio/ar/inference/)
+- [استدلال WebUI](https://speech.fish.audio/ar/inference/)
+- [الاستدلال عبر الخادم](https://speech.fish.audio/ar/server/)
+- [إعداد Docker](https://speech.fish.audio/ar/install/)
+
+بالنسبة لخادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md.).
+
+### دليل وكلاء LLM
+
+```
+قم بتثبيت وإعداد Fish Audio S2 باتباع التعليمات في https://speech.fish.audio/ar/install/ .
+```
 
 ## Fish Audio S2
 **أفضل نظام لتحويل النص إلى كلام بين الأنظمة مفتوحة المصدر ومغلقة المصدر**
 
-Fish Audio S2 هو أحدث نموذج طورته [Fish Audio](https://fish.audio/)، صُمم لإنتاج كلام يبدو طبيعياً وواقعياً وغنياً بالعواطف — ليس آلياً، ولا مسطحاً، ولا يقتصر على أسلوب السرد في الاستوديوهات.
-
-يركز Fish Audio S2 على المحادثات والحوارات اليومية، مما يتيح توليد أصوات لمتحدثين متعددين وجلسات حوارية متعددة الأدوار بشكل أصلي. كما يدعم التحكم عبر التعليمات.
+Fish Audio S2 هو أحدث نموذج من [Fish Audio](https://fish.audio/). تم تدريبه على أكثر من 10 ملايين ساعة صوتية عبر نحو 50 لغة، ويجمع بين المواءمة بالتعلم المعزز وبنية Dual-Autoregressive لإنتاج كلام طبيعي وواقعي وغني بالتعبير العاطفي.
 
-تحتوي سلسلة S2 على نماذج متعددة، النموذج مفتوح المصدر هو S2-Pro، وهو الأفضل في المجموعة.
+يدعم S2 التحكم الدقيق في النبرة والعاطفة داخل النص نفسه باستخدام وسوم باللغة الطبيعية مثل `[laugh]` و`[whispers]` و`[super happy]`، كما يدعم بشكل أصيل توليد متحدثين متعددين وحوارات متعددة الأدوار.
 
-تفضل بزيارة [موقع Fish Audio](https://fish.audio/) لتجربة مباشرة.
+يمكنك تجربة النموذج مباشرة عبر [موقع Fish Audio](https://fish.audio/)، وقراءة المزيد في [منشور المدونة](https://fish.audio/blog/fish-audio-open-sources-s2/).
 
 ### إصدارات النموذج
 
 | النموذج | الحجم | التوفر | الوصف |
 |------|------|-------------|-------------|
-| S2-Pro | 4B معايير | [huggingface](https://huggingface.co/fishaudio/s2-pro) | نموذج رائد كامل الميزات بأعلى جودة واستقرار |
-| S2-Flash | - - - - | [fish.audio](https://fish.audio/) | نموذجنا مغلق المصدر بسرعة أكبر وتأخير أقل |
+| S2-Pro | 4B معلمة | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | نموذج رائد كامل الميزات بأعلى مستوى من الجودة والاستقرار |
+
+يمكن العثور على مزيد من التفاصيل في [التقرير التقني](https://arxiv.org/abs/2411.01156).
+
+## نتائج القياس المعياري
+
+| المعيار | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (الصينية) | **0.54%** (الأفضل إجمالاً) |
+| Seed-TTS Eval — WER (الإنجليزية) | **0.99%** (الأفضل إجمالاً) |
+| Audio Turing Test (مع التعليمات) | **0.515** المتوسط البعدي |
+| EmergentTTS-Eval — معدل الفوز | **81.88%** (الأعلى إجمالاً) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — الجودة | **4.51 / 5.0** |
+| متعدد اللغات (MiniMax Testset) — أفضل WER | **11 من 24** لغة |
+| متعدد اللغات (MiniMax Testset) — أفضل SIM | **17 من 24** لغة |
 
-يمكن العثور على مزيد من التفاصيل حول النموذج في التقرير التقني.
+في Seed-TTS Eval، حقق S2 أقل WER بين جميع النماذج التي تم تقييمها، بما في ذلك الأنظمة المغلقة: Qwen3-TTS ‏(0.77/1.24)، وMiniMax Speech-02 ‏(0.99/1.90)، وSeed-TTS ‏(1.12/2.25). وفي Audio Turing Test، تفوقت قيمة 0.515 على Seed-TTS ‏(0.417) بنسبة 24% وعلى MiniMax-Speech ‏(0.387) بنسبة 33%. وفي EmergentTTS-Eval، حقق S2 نتائج قوية بشكل خاص في الخصائص شبه اللغوية (91.61%)، والأسئلة (84.41%)، والتعقيد النحوي (83.39%).
 
 ## أبرز المميزات
 
@@ -82,6 +107,29 @@ Fish Audio S2 هو أحدث نموذج طورته [Fish Audio](https://fish.audi
 
 يتيح Fish Audio S2 تحكمًا موضعيًا في توليد الكلام من خلال تضمين تعليمات باللغة الطبيعية مباشرة عند مواقع كلمات أو عبارات محددة داخل النص. وبدلًا من الاعتماد على مجموعة ثابتة من الوسوم المُعرّفة مسبقًا، يقبل S2 أوصافًا نصية حرة مثل [whisper in small voice] أو [professional broadcast tone] أو [pitch up]، مما يتيح تحكمًا مفتوحًا في التعبير على مستوى الكلمة.
 
+### بنية Dual-Autoregressive
+
+يعتمد S2 على Transformer أحادي الاتجاه (Decoder-only) مع مُرمّز صوتي قائم على RVQ (عدد 10 codebooks وبمعدل إطارات يقارب 21 هرتز). وتُقسّم بنية Dual-AR عملية التوليد إلى مرحلتين:
+
+- **Slow AR** يعمل على المحور الزمني ويتنبأ بالـ semantic codebook الأساسي.
+- **Fast AR** يولّد الـ 9 residual codebooks المتبقية في كل خطوة زمنية لإعادة بناء التفاصيل الصوتية الدقيقة.
+
+هذا التصميم غير المتماثل (4B معلمة على المحور الزمني و400M على محور العمق) يرفع كفاءة الاستدلال مع الحفاظ على جودة الصوت.
+
+### المواءمة بالتعلم المعزز
+
+يستخدم S2 خوارزمية Group Relative Policy Optimization (GRPO) للمواءمة بعد التدريب. ويتم إعادة استخدام نفس النماذج التي استُخدمت لتصفية بيانات التدريب وتعليقها كنماذج مكافأة في التعلم المعزز مباشرة، مما يلغي عدم تطابق التوزيع بين بيانات ما قبل التدريب وأهداف ما بعد التدريب. وتجمع إشارة المكافأة بين الدقة الدلالية، والالتزام بالتعليمات، وتقييم التفضيل الصوتي، وتشابه النبرة.
+
+### البث الإنتاجي عبر SGLang
+
+لأن بنية Dual-AR متماثلة بنيويًا مع نماذج LLM autoregressive القياسية، فإن S2 يرث مباشرة تحسينات الخدمة الأصلية في SGLang، بما في ذلك: continuous batching، وpaged KV cache، وCUDA graph replay، وprefix caching المعتمد على RadixAttention.
+
+على بطاقة NVIDIA H200 واحدة:
+
+- **عامل الزمن الحقيقي (RTF):** 0.195
+- **الزمن حتى أول مقطع صوتي:** حوالي 100 مللي ثانية
+- **معدل المعالجة:** أكثر من 3,000 acoustic tokens/s مع الحفاظ على RTF أقل من 0.5
+
 ### دعم لغات متعددة
 
 يدعم Fish Audio S2 تحويل النص إلى كلام بجودة عالية ولغات متعددة دون الحاجة إلى رموز صوتية أو معالجة مسبقة خاصة بكل لغة. بما في ذلك:
@@ -105,6 +153,7 @@ Fish Audio S2 هو أحدث نموذج طورته [Fish Audio](https://fish.audi
 ### استنساخ صوت سريع
 
 يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
+لاستخدام خادم SGLang، راجع https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
 
 ---
 

+ 61 - 12
docs/README.ja.md

@@ -34,7 +34,7 @@
     <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
       <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
 </div>
@@ -47,32 +47,57 @@
 > **法的免責事項**
 > 私たちはコードベースの不法な使用について一切の責任を負いません。DMCA 及びその他の関連法律について、現地の法律をご参照ください。
 
-## ここから始める
+## クイックスタート
 
-こちらは Fish Speech の公式ドキュメントです。手順に従って簡単に始めることができます。
+### まずはドキュメントから
+
+Fish Audio S2 の公式ドキュメントです。以下からすぐに始められます。
 
 - [インストール](https://speech.fish.audio/ja/install/)
-- [推論](https://speech.fish.audio/ja/inference/)
+- [コマンドライン推論](https://speech.fish.audio/ja/inference/)
+- [WebUI 推論](https://speech.fish.audio/ja/inference/)
+- [サーバー推論](https://speech.fish.audio/ja/server/)
+- [Docker セットアップ](https://speech.fish.audio/ja/install/)
+
+SGLang サーバーについては [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md.) を参照してください。
+
+### LLM Agent 向け
+
+```
+https://speech.fish.audio/ja/install/ の手順に従って、Fish Audio S2 をインストール・設定してください。
+```
 
 ## Fish Audio S2
 **オープンソースおよびクローズドソースの中で最も優れたテキスト読み上げシステム**
 
-Fish Audio S2 は、[Fish Audio](https://fish.audio/) によって開発された最新のモデルで、自然でリアル、かつ感情豊かな音声を生成するように設計されています。ロボット的ではなく、平坦でもなく、スタジオスタイルのナレーションに限定されません。
-
-Fish Audio S2 は日常の会話に焦点を当てており、ネイティブなマルチスピーカーおよびマルチターンの生成をサポートしています。また、命令制御もサポートしています。
+Fish Audio S2 は [Fish Audio](https://fish.audio/) が開発した最新モデルです。約 50 言語・1,000 万時間超の音声データで学習され、強化学習アラインメントと Dual-Autoregressive アーキテクチャを組み合わせることで、自然でリアルかつ感情表現豊かな音声を生成します。
 
-S2 シリーズには複数のモデルが含まれており、オープンソースモデルは S2-Pro で、このシリーズの中で最もパフォーマンスの高いモデルです。
+S2 は `[laugh]`、`[whispers]`、`[super happy]` といった自然言語タグで、韻律や感情を文中の任意位置で細かく制御できます。さらに、マルチスピーカー生成とマルチターン生成にもネイティブ対応しています。
 
-リアルタイムのエクスペリエンスについては、[Fish Audio Webサイト](https://fish.audio/) にアクセスしてください。
+ライブデモは [Fish Audio ウェブサイト](https://fish.audio/) から、詳細は [ブログ記事](https://fish.audio/blog/fish-audio-open-sources-s2/) をご覧ください。
 
 ### モデルバリアント
 
 | モデル | サイズ | 利用可能性 | 説明 |
 |------|------|-------------|-------------|
-| S2-Pro | 4B パラメータ | [huggingface](https://huggingface.co/fishaudio/s2-pro) | 最高の品質と安定性を備えた、フル機能のフラッグシップモデル |
-| S2-Flash | - - - - | [fish.audio](https://fish.audio/) | より高速で低遅延なクローズドソースモデル |
+| S2-Pro | 4B パラメータ | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 品質と安定性を最大化したフル機能のフラッグシップモデル |
+
+モデルの詳細は[技術レポート](https://arxiv.org/abs/2411.01156)をご参照ください。
+
+## ベンチマーク結果
+
+| ベンチマーク | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER(中国語) | **0.54%**(全体最良) |
+| Seed-TTS Eval — WER(英語) | **0.99%**(全体最良) |
+| Audio Turing Test(指示あり) | **0.515** 事後平均値 |
+| EmergentTTS-Eval — 勝率 | **81.88%**(全体最高) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — 品質 | **4.51 / 5.0** |
+| 多言語(MiniMax Testset)— 最良 WER | **24 言語中 11 言語** |
+| 多言語(MiniMax Testset)— 最良 SIM | **24 言語中 17 言語** |
 
-モデルの詳細については、技術レポートを参照してください。
+Seed-TTS Eval では、S2 はクローズドソースを含む全評価モデルの中で最小 WER を達成しました:Qwen3-TTS(0.77/1.24)、MiniMax Speech-02(0.99/1.90)、Seed-TTS(1.12/2.25)。Audio Turing Test では 0.515 を記録し、Seed-TTS(0.417)比で 24%、MiniMax-Speech(0.387)比で 33% 上回りました。EmergentTTS-Eval では、副言語情報(91.61%)、疑問文(84.41%)、統語的複雑性(83.39%)で特に高い成績を示しています
 
 ## ハイライト
 
@@ -82,6 +107,29 @@ S2 シリーズには複数のモデルが含まれており、オープンソ
 
 Fish Audio S2 では、テキスト内の特定の単語やフレーズ位置に自然言語の指示を直接埋め込むことで、音声生成を局所的に制御できます。固定の事前定義タグに依存するのではなく、S2 は [whisper in small voice]、[professional broadcast tone]、[pitch up] のような自由形式のテキスト記述を受け付け、単語レベルで表現をオープンエンドに制御できます。
 
+### 二重自己回帰(Dual-Autoregressive)アーキテクチャ
+
+S2 はデコーダー専用 Transformer と RVQ ベースの音声コーデック(10 codebooks、約 21 Hz)を組み合わせています。Dual-AR は生成を 2 段階に分割します。
+
+- **Slow AR** は時間軸方向に動作し、主となる semantic codebook を予測。
+- **Fast AR** は各時刻で残り 9 個の residual codebook を生成し、細かな音響ディテールを復元。
+
+この非対称設計(時間軸 4B パラメータ、深さ軸 400M パラメータ)により、音質を保ちながら推論効率を高めています。
+
+### 強化学習アラインメント
+
+S2 は後学習アラインメントに Group Relative Policy Optimization(GRPO)を採用しています。学習データのフィルタリングとアノテーションに使った同一モデル群を、そのまま RL の報酬モデルとして再利用することで、事前学習データ分布と事後学習目的のミスマッチを抑制しています。報酬信号には、意味的正確性、指示追従性、音響的選好スコア、音色類似度が含まれます。
+
+### SGLang による本番向けストリーミング
+
+Dual-AR は構造的に標準的な自己回帰 LLM と同型のため、S2 は SGLang の LLM 向け最適化をそのまま活用できます。たとえば continuous batching、paged KV cache、CUDA graph replay、RadixAttention ベースの prefix caching です。
+
+単一の NVIDIA H200 GPU での実測:
+
+- **RTF(Real-Time Factor):** 0.195
+- **初回音声出力までの時間:** 約 100 ms
+- **スループット:** RTF 0.5 未満を維持しつつ 3,000+ acoustic tokens/s
+
 ### 多言語サポート
 
 Fish Audio S2 は、音素や言語固有の前処理を必要とせずに、高品質な多言語テキスト読み上げをサポートします。以下を含みます:
@@ -105,6 +153,7 @@ Fish Audio S2 では、ユーザーが複数のスピーカーを含む参照オ
 ### 高速音声クローニング
 
 Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
+SGLang サーバーの利用については https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md を参照してください。
 
 ---
 

+ 61 - 12
docs/README.ko.md

@@ -34,7 +34,7 @@
     <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
       <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
 </div>
@@ -47,32 +47,57 @@
 > **법적 면책조항**
 > 저희는 코드베이스의 불법적인 사용에 대해 어떠한 책임도 지지 않습니다. DMCA 및 기타 관련 법률에 대한 현지 법률을 참조하세요.
 
-## 여기서 시작하세요
+## 빠른 시작
 
-여기는 Fish Speech의 공식 문서입니다. 지침을 따라 쉽게 시작하세요.
+### 문서로 바로 시작하기
+
+Fish Audio S2 공식 문서입니다. 아래 링크에서 바로 시작할 수 있습니다.
 
 - [설치](https://speech.fish.audio/ko/install/)
-- [추론](https://speech.fish.audio/ko/inference/)
+- [커맨드라인 추론](https://speech.fish.audio/ko/inference/)
+- [WebUI 추론](https://speech.fish.audio/ko/inference/)
+- [서버 추론](https://speech.fish.audio/ko/server/)
+- [Docker 설정](https://speech.fish.audio/ko/install/)
+
+SGLang 서버는 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md.)를 참고하세요.
+
+### LLM Agent 가이드
+
+```
+https://speech.fish.audio/ko/install/ 문서를 따라 Fish Audio S2를 설치하고 구성하세요.
+```
 
 ## Fish Audio S2
 **오픈 소스와 클로즈드 소스 모두에서 가장 뛰어난 텍스트 음성 변환 시스템**
 
-Fish Audio S2는 [Fish Audio](https://fish.audio/)가 개발한 최신 모델로, 자연스럽고 사실적이며 감정적으로 풍부한 음성을 생성하도록 설계되었습니다. 로봇 같지 않고, 평평하지 않으며, 스튜디오 스타일의 내레이션에 제한되지 않습니다.
-
-Fish Audio S2는 일상적인 대화와 대화에 집중하여 네이티브 멀티 화자 및 멀티 턴 생성을 가능하게 합니다. 또한 명령 제어도 지원합니다.
+Fish Audio S2는 [Fish Audio](https://fish.audio/)가 개발한 최신 모델입니다. 약 50개 언어, 1,000만 시간 이상의 오디오 데이터로 학습되었고, 강화학습 정렬과 Dual-Autoregressive 아키텍처를 결합해 자연스럽고 사실적이며 감정 표현이 풍부한 음성을 생성합니다.
 
-S2 시리즈에는 여러 모델이 포함되어 있으며, 오픈 소스 모델은 S2-Pro로 컬렉션 중 최고의 모델입니다.
+S2는 `[laugh]`, `[whispers]`, `[super happy]` 같은 자연어 태그를 사용해 운율과 감정을 문장 내부에서 세밀하게 제어할 수 있으며, 멀티 화자/멀티 턴 생성도 네이티브로 지원합니다.
 
-실시간 체험을 위해 [Fish Audio 웹사이트](https://fish.audio/)를 방문하세요.
+실시간 데모는 [Fish Audio 웹사이트](https://fish.audio/)에서, 자세한 내용은 [블로그 글](https://fish.audio/blog/fish-audio-open-sources-s2/)에서 확인할 수 있습니다.
 
 ### 모델 변형
 
 | 모델 | 크기 | 가용성 | 설명 |
 |------|------|-------------|-------------|
-| S2-Pro | 4B 매개변수 | [huggingface](https://huggingface.co/fishaudio/s2-pro) | 최고의 품질과 안정성을 갖춘 전체 기능 플래그십 모델 |
-| S2-Flash | - - - - | [fish.audio](https://fish.audio/) | 더 빠른 속도와 더 낮은 지연 시간을 가진 클로즈드 소스 모델 |
+| S2-Pro | 4B 매개변수 | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 최고 수준의 품질과 안정성을 제공하는 풀기능 플래그십 모델 |
+
+모델 상세는 [기술 보고서](https://arxiv.org/abs/2411.01156)를 참고하세요.
+
+## 벤치마크 결과
+
+| 벤치마크 | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (중국어) | **0.54%** (전체 최고) |
+| Seed-TTS Eval — WER (영어) | **0.99%** (전체 최고) |
+| Audio Turing Test (지시 포함) | **0.515** 사후 평균 |
+| EmergentTTS-Eval — 승률 | **81.88%** (전체 최고) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — 품질 | **4.51 / 5.0** |
+| 다국어 (MiniMax Testset) — 최고 WER | **24개 언어 중 11개** |
+| 다국어 (MiniMax Testset) — 최고 SIM | **24개 언어 중 17개** |
 
-모델에 대한 자세한 내용은 기술 보고서를 참조하십시오.
+Seed-TTS Eval에서 S2는 클로즈드 소스 시스템을 포함한 전체 비교 모델 중 가장 낮은 WER를 기록했습니다: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). Audio Turing Test에서는 0.515를 기록해 Seed-TTS (0.417) 대비 24%, MiniMax-Speech (0.387) 대비 33% 높았습니다. EmergentTTS-Eval에서는 파라언어 표현(91.61%), 의문문(84.41%), 구문 복잡도(83.39%)에서 특히 강한 성능을 보였습니다.
 
 ## 주요 특징
 
@@ -82,6 +107,29 @@ S2 시리즈에는 여러 모델이 포함되어 있으며, 오픈 소스 모델
 
 Fish Audio S2는 텍스트의 특정 단어 또는 구문 위치에 자연어 지시를 직접 삽입해 음성 생성을 국소적으로 제어할 수 있습니다. 고정된 사전 정의 태그에 의존하는 대신, S2는 [whisper in small voice], [professional broadcast tone], [pitch up] 같은 자유 형식 텍스트 설명을 받아 단어 수준의 개방형 표현 제어를 지원합니다.
 
+### Dual-Autoregressive 아키텍처
+
+S2는 decoder-only Transformer와 RVQ 기반 오디오 코덱(10 codebooks, 약 21 Hz 프레임레이트)을 결합합니다. Dual-AR은 생성 과정을 두 단계로 나눕니다.
+
+- **Slow AR**: 시간축을 따라 동작하며 주 semantic codebook을 예측
+- **Fast AR**: 각 시점에서 나머지 9개 residual codebook을 생성해 세밀한 음향 디테일을 복원
+
+이 비대칭 설계(시간축 4B 파라미터, 깊이축 400M 파라미터)는 음질을 유지하면서 추론 효율을 높입니다.
+
+### 강화학습 정렬
+
+S2는 후학습 정렬을 위해 Group Relative Policy Optimization(GRPO)을 사용합니다. 학습 데이터 필터링/라벨링에 쓰인 동일한 모델을 RL 보상 모델로 재사용해, 사전학습 데이터 분포와 후학습 목표 간의 분포 불일치를 줄였습니다. 보상 신호는 의미 정확도, 지시 준수도, 음향 선호 점수, 음색 유사도를 함께 반영합니다.
+
+### SGLang 기반 프로덕션 스트리밍
+
+Dual-AR 구조는 표준 자기회귀 LLM과 구조적으로 동형이기 때문에, S2는 SGLang의 LLM 서빙 최적화를 그대로 활용합니다. 예: continuous batching, paged KV cache, CUDA graph replay, RadixAttention 기반 prefix caching.
+
+NVIDIA H200 단일 GPU 기준:
+
+- **실시간 계수(RTF):** 0.195
+- **첫 오디오 출력까지 시간:** 약 100 ms
+- **처리량:** RTF 0.5 미만 유지 시 3,000+ acoustic tokens/s
+
 ### 다국어 지원
 
 Fish Audio S2는 음소나 언어별 전처리 없이 고품질 다국어 텍스트 음성 변환을 지원합니다. 포함 사항:
@@ -105,6 +153,7 @@ Fish Audio S2는 사용자가 여러 화자가 포함된 참조 오디오를 업
 ### 빠른 음성 복제
 
 Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
+SGLang 서버 사용은 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 를 참고하세요.
 
 ---
 

+ 62 - 13
docs/README.pt-BR.md

@@ -34,7 +34,7 @@
     <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
       <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
 </div>
@@ -47,32 +47,57 @@
 > **Isenção de Responsabilidade Legal**
 > Não nos responsabilizamos por qualquer uso ilegal do repositório. Consulte as leis locais sobre DMCA e outras leis relacionadas.
 
-## Comece Aqui
+## Início Rápido
 
-Aqui estão os documentos oficiais do Fish Speech, siga as instruções para começar facilmente.
+### Comece pela documentação
+
+Esta é a documentação oficial do Fish Audio S2. Você pode começar por aqui:
 
 - [Instalação](https://speech.fish.audio/pt/install/)
-- [Inferência](https://speech.fish.audio/pt/inference/)
+- [Inferência por Linha de Comando](https://speech.fish.audio/pt/inference/)
+- [Inferência WebUI](https://speech.fish.audio/pt/inference/)
+- [Inferência via Servidor](https://speech.fish.audio/pt/server/)
+- [Configuração Docker](https://speech.fish.audio/pt/install/)
+
+Para servidor com SGLang, consulte o [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md.).
+
+### Guia para agentes LLM
+
+```
+Instale e configure o Fish Audio S2 seguindo as instruções em https://speech.fish.audio/pt/install/ .
+```
 
 ## Fish Audio S2
 **O melhor sistema de conversão de texto em fala entre código aberto e código fechado**
 
-O Fish Audio S2 é o modelo mais recente desenvolvido pela [Fish Audio](https://fish.audio/), projetado para gerar falas que soam naturais, realistas e emocionalmente ricas — não robóticas, não monótonas e não limitadas à narração em estilo de estúdio.
-
-O Fish Audio S2 foca em conversas diárias e diálogos, o que permite a geração nativa de múltiplos falantes e turnos. Também suporta controle por instrução.
+O Fish Audio S2 é o modelo mais recente da [Fish Audio](https://fish.audio/). Treinado com mais de 10 milhões de horas de áudio em cerca de 50 idiomas, o S2 combina alinhamento por reforço com uma arquitetura Dual-Autoregressive para gerar fala natural, realista e emocionalmente expressiva.
 
-A série S2 contém vários modelos, o modelo de código aberto é o S2-Pro, que é o melhor modelo da coleção.
+O S2 permite controle fino de prosódia e emoção dentro da própria frase com tags em linguagem natural, como `[laugh]`, `[whispers]` e `[super happy]`, além de oferecer suporte nativo a múltiplos falantes e múltiplos turnos.
 
-Visite o [site da Fish Audio](https://fish.audio/) para um playground ao vivo.
+Acesse o [site da Fish Audio](https://fish.audio/) para testar ao vivo e leia o [post no blog](https://fish.audio/blog/fish-audio-open-sources-s2/) para mais detalhes.
 
 ### Variantes do Modelo
 
 | Modelo | Tamanho | Disponibilidade | Descrição |
 |------|------|-------------|-------------|
-| S2-Pro | 4B parâmetros | [huggingface](https://huggingface.co/fishaudio/s2-pro) | Modelo carro-chefe completo com máxima qualidade e estabilidade |
-| S2-Flash | - - - - | [fish.audio](https://fish.audio/) | Nosso modelo de código fechado com maior velocidade e menor latência |
+| S2-Pro | 4B parâmetros | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Modelo carro-chefe completo com máxima qualidade e estabilidade |
+
+Mais detalhes podem ser encontrados no [relatório técnico](https://arxiv.org/abs/2411.01156).
+
+## Resultados de Benchmark
+
+| Benchmark | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (Chinês) | **0.54%** (melhor geral) |
+| Seed-TTS Eval — WER (Inglês) | **0.99%** (melhor geral) |
+| Audio Turing Test (com instrução) | **0.515** média a posteriori |
+| EmergentTTS-Eval — Taxa de vitória | **81.88%** (maior geral) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — Qualidade | **4.51 / 5.0** |
+| Multilíngue (MiniMax Testset) — Melhor WER | **11 de 24** idiomas |
+| Multilíngue (MiniMax Testset) — Melhor SIM | **17 de 24** idiomas |
 
-Mais detalhes do modelo podem ser encontrados no relatório técnico.
+No Seed-TTS Eval, o S2 obteve o menor WER entre todos os modelos avaliados, incluindo sistemas fechados: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90) e Seed-TTS (1.12/2.25). No Audio Turing Test, o valor 0.515 supera o Seed-TTS (0.417) em 24% e o MiniMax-Speech (0.387) em 33%. No EmergentTTS-Eval, o S2 se destacou especialmente em paralinguística (91.61%), perguntas (84.41%) e complexidade sintática (83.39%).
 
 ## Destaques
 
@@ -82,6 +107,29 @@ Mais detalhes do modelo podem ser encontrados no relatório técnico.
 
 O Fish Audio S2 permite controle localizado da geração de fala ao incorporar instruções em linguagem natural diretamente em posições específicas de palavras ou frases no texto. Em vez de depender de um conjunto fixo de tags predefinidas, o S2 aceita descrições textuais livres, como [whisper in small voice], [professional broadcast tone] ou [pitch up], permitindo controle de expressão aberto no nível da palavra.
 
+### Arquitetura Dual-Autoregressive
+
+O S2 é baseado em um transformer apenas decodificador, combinado com um codec de áudio RVQ (10 codebooks, ~21 Hz de taxa de quadros). A arquitetura Dual-AR divide a geração em duas etapas:
+
+- **Slow AR** opera no eixo temporal e prevê o codebook semântico principal.
+- **Fast AR** gera os 9 codebooks residuais restantes em cada passo de tempo, reconstruindo detalhes acústicos finos.
+
+Esse desenho assimétrico (4B parâmetros no eixo temporal e 400M no eixo de profundidade) mantém a inferência eficiente sem sacrificar fidelidade de áudio.
+
+### Alinhamento por Reforço
+
+O S2 usa Group Relative Policy Optimization (GRPO) no pós-treinamento. Os mesmos modelos usados para filtrar e anotar dados de treino são reutilizados diretamente como modelos de recompensa no RL, eliminando o desalinhamento de distribuição entre os dados de pré-treinamento e os objetivos de pós-treinamento. O sinal de recompensa combina precisão semântica, aderência à instrução, preferência acústica e similaridade de timbre.
+
+### Streaming em Produção com SGLang
+
+Como a arquitetura Dual-AR é estruturalmente isomórfica a LLMs autoregressivos padrão, o S2 herda diretamente as otimizações nativas de serving do SGLang, incluindo continuous batching, paged KV cache, CUDA graph replay e prefix caching com RadixAttention.
+
+Em uma única NVIDIA H200:
+
+- **RTF (Real-Time Factor):** 0.195
+- **Tempo até o primeiro áudio:** ~100 ms
+- **Throughput:** mais de 3.000 acoustic tokens/s mantendo RTF abaixo de 0.5
+
 ### Suporte Multilíngue
 
 O Fish Audio S2 oferece suporte a conversão de texto em fala multilíngue de alta qualidade sem a necessidade de fonemas ou processamento específico de idioma. Incluindo:
@@ -96,7 +144,7 @@ A lista está em constante expansão, verifique o [Fish Audio](https://fish.audi
 
 <img src="./assets/chattemplate.png" width=200%>
 
-O Fish Audio S2 permite que os usuários carreguem áudio de referência com vários falantes; o modelo lidará com as características de cada falante por meio do token `<|speaker:i|>`. Então, você pode controlar o desempenho do modelo com the token de ID do falante, permitindo que uma única geração inclua vários falantes. Você não precisa mais carregar áudios de referência separadamente para cada falante.
+O Fish Audio S2 permite enviar um áudio de referência com vários falantes; o modelo processa as características de cada voz por meio do token `<|speaker:i|>`. Depois, você controla o comportamento do modelo com o token de ID do falante, permitindo incluir várias vozes em uma única geração. Assim, não é mais necessário subir um áudio de referência separado para cada falante.
 
 ### Geração de Múltiplos Turnos
 
@@ -105,6 +153,7 @@ Graças à extensão do contexto do modelo, nosso modelo agora pode usar informa
 ### Clonagem de Voz Rápida
 
 O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
+Para usar o servidor SGLang, consulte https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
 
 ---
 

+ 62 - 13
docs/README.zh.md

@@ -34,7 +34,7 @@
     <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
       <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
 </div>
@@ -48,32 +48,57 @@
 > **法律免责声明**
 > 我们不对代码库的任何非法使用承担责任。请参考您当地关于 DMCA 和其他相关法律的法规。
 
-## 从这里开始
+## 快速开始
 
-这里是 Fish Speech 的官方文档,请按照说明轻松入门。
+### 文档入口
+
+这里是 Fish Audio S2 的官方文档,请按照说明轻松入门。
 
 - [安装](https://speech.fish.audio/zh/install/)
-- [推理](https://speech.fish.audio/zh/inference/)
+- [命令行推理](https://speech.fish.audio/zh/inference/)
+- [WebUI 推理](https://speech.fish.audio/zh/inference/)
+- [服务端推理](https://speech.fish.audio/zh/server/)
+- [Docker 部署](https://speech.fish.audio/zh/install/)
 
-## Fish Audio S2
-**开源和闭源中最出色的文本转语音系统**
+如需 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md.)。
+
+### LLM Agent 指南
+
+```
+请先阅读 https://speech.fish.audio/zh/install/ ,并按文档安装和配置 Fish Audio S2。
+```
 
-Fish Audio S2 是由 [Fish Audio](https://fish.audio/) 开发的最新模型,旨在生成听起来自然、真实且情感丰富的语音——不机械、不平淡,也不局限于录音室风格的朗读。
+## Fish Audio S2
+**在开源与闭源方案中都处于领先水平的文本转语音系统**
 
-Fish Audio S2 专注于日常对话,支持原生多说话人和多轮生成。同时支持指令控制。
+Fish Audio S2 是由 [Fish Audio](https://fish.audio/) 开发的最新模型。S2 在约 50 种语言、超过 1000 万小时音频数据上完成训练,并结合强化学习对齐与双自回归架构,能够生成自然、真实且情感丰富的语音
 
-S2 系列包含多个模型,开源模型为 S2-Pro,是该系列中性能最强的模型
+S2 支持通过自然语言标签(如 `[laugh]`、`[whispers]`、`[super happy]`)对韵律和情绪进行细粒度行内控制,同时原生支持多说话人和多轮生成
 
-请访问 [Fish Audio 网站](https://fish.audio/) 以获取实时体验。
+请访问 [Fish Audio 网站](https://fish.audio/) 体验在线演示,并阅读[博客文章](https://fish.audio/blog/fish-audio-open-sources-s2/)了解更多细节
 
 ### 模型变体
 
 | 模型 | 大小 | 可用性 | 描述 |
 |------|------|-------------|-------------|
-| S2-Pro | 4B 参数 | [huggingface](https://huggingface.co/fishaudio/s2-pro) | 功能齐全的旗舰模型,具有最高质量和稳定性 |
-| S2-Flash | - - - - | [fish.audio](https://fish.audio/) | 我们的闭源模型,具有更快的速度和更低的延迟 |
+| S2-Pro | 4B 参数 | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 功能齐全的旗舰模型,具有最高质量和稳定性 |
+
+有关模型的更多详情,请参见[技术报告](https://arxiv.org/abs/2411.01156)。
+
+## 基准测试结果
+
+| 基准 | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER(中文) | **0.54%**(总体最佳) |
+| Seed-TTS Eval — WER(英文) | **0.99%**(总体最佳) |
+| Audio Turing Test(含指令) | **0.515** 后验均值 |
+| EmergentTTS-Eval — 胜率 | **81.88%**(总体最高) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — 质量 | **4.51 / 5.0** |
+| 多语言(MiniMax Testset)— 最佳 WER | **24** 种语言中的 **11** 种 |
+| 多语言(MiniMax Testset)— 最佳 SIM | **24** 种语言中的 **17** 种 |
 
-有关模型的更多详情,请参见技术报告。
+在 Seed-TTS Eval 上,S2 在所有已评估模型(包括闭源系统)中实现了最低 WER:Qwen3-TTS(0.77/1.24)、MiniMax Speech-02(0.99/1.90)、Seed-TTS(1.12/2.25)。在 Audio Turing Test 上,S2 的 0.515 相比 Seed-TTS(0.417)提升 24%,相比 MiniMax-Speech(0.387)提升 33%。在 EmergentTTS-Eval 中,S2 在副语言学(91.61% 胜率)、疑问句(84.41%)和句法复杂度(83.39%)等维度表现尤为突出
 
 ## 亮点
 
@@ -83,6 +108,29 @@ S2 系列包含多个模型,开源模型为 S2-Pro,是该系列中性能最
 
 Fish Audio S2 支持在文本中的特定词或短语位置直接嵌入自然语言指令,从而对语音生成进行局部控制。与依赖固定预设标签不同,S2 接受自由形式的文本描述,例如 [whisper in small voice]、[professional broadcast tone] 或 [pitch up],实现词级别的开放式表达控制。
 
+### 双自回归架构(Dual-Autoregressive)
+
+S2 基于仅解码器 Transformer,并结合 RVQ 音频编解码器(10 个码本,约 21 Hz 帧率)。Dual-AR 架构将生成拆分为两个阶段:
+
+- **Slow AR** 沿时间轴运行,预测主语义码本。
+- **Fast AR** 在每个时间步生成剩余 9 个残差码本,用于重建细粒度声学细节。
+
+这种非对称设计(时间轴 4B 参数、深度轴 400M 参数)在保持音频保真度的同时,提高了推理效率。
+
+### 强化学习对齐
+
+S2 使用 Group Relative Policy Optimization(GRPO)进行后训练对齐。用于过滤和标注训练数据的同一批模型被直接复用为 RL 的奖励模型,从而避免了预训练数据分布与后训练目标之间的不匹配。奖励信号综合了语义准确性、指令遵循、声学偏好评分与音色相似度。
+
+### 基于 SGLang 的生产级流式推理
+
+由于 Dual-AR 架构在结构上与标准自回归 LLM 同构,S2 可以直接继承 SGLang 提供的 LLM 原生服务优化能力,包括连续批处理、分页 KV Cache、CUDA Graph Replay 与基于 RadixAttention 的前缀缓存。
+
+在单张 NVIDIA H200 GPU 上:
+
+- **实时因子(RTF):** 0.195
+- **首音频延迟:** 约 100 ms
+- **吞吐:** 在 RTF 低于 0.5 的情况下达到 3,000+ acoustic tokens/s
+
 ### 多语言支持
 
 Fish Audio S2 支持高质量的多语言文本转语音,无需音素或特定语言的预处理。包括:
@@ -106,6 +154,7 @@ Fish Audio S2 允许用户上传包含多个说话人的参考音频,模型将
 ### 快速语音克隆
 
 Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
+如需使用 SGLang Server,请参考 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 。
 
 ---
 

+ 4 - 1
docs/en/index.md

@@ -52,7 +52,10 @@
 This is the official documentation for Fish Speech. Please follow the instructions to get started easily.
 
 - [Installation](install.md)
-- [Inference](inference.md)
+- [Command Line Inference](inference.md#command-line-inference)
+- [WebUI Inference](inference.md#webui-inference)
+- [Server Inference](server.md)
+- [Docker Setup](install.md#docker-setup)
 
 ## Fish Audio S2
 **The best text-to-speech system in both open-source and closed-source**

+ 59 - 0
docs/en/server.md

@@ -0,0 +1,59 @@
+# Server
+
+This page covers server-side inference for Fish Audio S2, plus quick links for WebUI inference and Docker deployment.
+
+## API Server Inference
+
+Fish Speech provides an HTTP API server entrypoint at `tools/api_server.py`.
+
+### Start the server locally
+
+```bash
+python tools/api_server.py \
+  --llama-checkpoint-path checkpoints/s2-pro \
+  --decoder-checkpoint-path checkpoints/s2-pro/codec.pth \
+  --listen 0.0.0.0:8080
+```
+
+Common options:
+
+- `--compile`: enable `torch.compile` optimization
+- `--half`: use fp16 mode
+- `--api-key`: require bearer token authentication
+- `--workers`: set worker process count
+
+### Health check
+
+```bash
+curl -X GET http://127.0.0.1:8080/v1/health
+```
+
+Expected response:
+
+```json
+{"status":"ok"}
+```
+
+### Main API endpoint
+
+- `POST /v1/tts` for text-to-speech generation
+- `POST /v1/vqgan/encode` for VQ encode
+- `POST /v1/vqgan/decode` for VQ decode
+
+## WebUI Inference
+
+For WebUI usage, see:
+
+- [WebUI Inference](https://speech.fish.audio/inference/#webui-inference)
+
+## Docker
+
+For Docker-based server or WebUI deployment, see:
+
+- [Docker Setup](https://speech.fish.audio/install/#docker-setup)
+
+You can also start the server profile directly with Docker Compose:
+
+```bash
+docker compose --profile server up
+```

+ 22 - 18
fish_speech/content_sequence.py

@@ -182,14 +182,16 @@ class ContentSequence:
         # Optimization: Batch conversion for ignore tokens
         ignore_loss_token_ids = []
         if ignore_loss_tokens:
-             # Use the wrapper method which uses convert_tokens_to_ids
-            ignore_loss_token_ids = [tokenizer.get_token_id(i) for i in ignore_loss_tokens]
+            # Use the wrapper method which uses convert_tokens_to_ids
+            ignore_loss_token_ids = [
+                tokenizer.get_token_id(i) for i in ignore_loss_tokens
+            ]
 
         for part in self.parts:
             if isinstance(part, TextPart):
                 if part.tokens is None:
                     assert part.text is not None
-                    # Optimization: Explicitly disable special tokens (BOS/EOS) 
+                    # Optimization: Explicitly disable special tokens (BOS/EOS)
                     # because we are constructing the sequence manually
                     tokens = tokenizer.encode(part.text, add_special_tokens=False)
                 else:
@@ -202,10 +204,10 @@ class ContentSequence:
                 # We use arithmetic offset: code + semantic_begin_id
                 # This assumes semantic tokens are contiguous in the vocab (DualAR requirement)
                 curr_codes = part.codes.clone().to(torch.int)
-                
+
                 # Use int64 (long) for token IDs to avoid overflow or type mismatch in embedding
                 tokens = (curr_codes[0] + tokenizer.semantic_begin_id).to(torch.long)
-                
+
                 vq_parts.append(curr_codes)
                 vq_require_losses.append(part.cal_loss)
             else:
@@ -235,17 +237,17 @@ class ContentSequence:
 
         # Concatenate all tensors
         if not all_tokens:
-             # Handle empty case safely
-             tokens = torch.empty(0, dtype=torch.long)
-             labels = torch.empty(0, dtype=torch.long)
-             vq_masks = torch.empty(0, dtype=torch.bool)
-             audio_masks = torch.empty(0, dtype=torch.bool)
+            # Handle empty case safely
+            tokens = torch.empty(0, dtype=torch.long)
+            labels = torch.empty(0, dtype=torch.long)
+            vq_masks = torch.empty(0, dtype=torch.bool)
+            audio_masks = torch.empty(0, dtype=torch.bool)
         else:
             tokens = torch.cat(all_tokens, dim=0)
             labels = torch.cat(all_labels, dim=0)
             vq_masks = torch.cat(vq_masks, dim=0)
             audio_masks = torch.cat(audio_masks, dim=0)
-        
+
         vq_require_losses = torch.tensor(vq_require_losses, dtype=torch.bool)
 
         # Apply shift if needed for next-token prediction
@@ -294,9 +296,9 @@ class ContentSequence:
         ):
             return values, None, None
 
-        audio_parts = None 
+        audio_parts = None
         audio_masks = None
-        
+
         if encoded.vq_parts is not None and len(encoded.vq_parts) > 0:
             vq_parts = encoded.vq_parts
             # List[Tensor(1, T)] -> Tensor(1, Total_T) -> Tensor(1, Total_T)
@@ -305,10 +307,12 @@ class ContentSequence:
                 # We need to be careful here: vq_parts is a list of tensors from different VQPart segments
                 # They correspond to encoded.vq_mask_tokens
                 # Since we just want to fill the 'values' tensor at the right positions:
-                all_vq_codes = torch.cat(vq_parts, dim=1) # Shape: (C, Total_Semantic_Tokens)
+                all_vq_codes = torch.cat(
+                    vq_parts, dim=1
+                )  # Shape: (C, Total_Semantic_Tokens)
             else:
                 all_vq_codes = vq_parts[0]
-                
+
             # Values[0] is already the Main Token ID (Semantic Begin + Code)
             # Values[1:] should be the codes themselves
             values[1:, encoded.vq_mask_tokens] = all_vq_codes.to(dtype=torch.long)
@@ -383,10 +387,10 @@ class ContentSequence:
 
             # Use HF decode
             val = tokenizer.decode([token_id])
-            
+
             # Simple fallback for visualization if decode returns empty or weird stuff for special tokens
             if not val:
-                 val = f"<{token_id}>"
+                val = f"<{token_id}>"
 
             if lab == -100:
                 print_in_green(val)
@@ -396,4 +400,4 @@ class ContentSequence:
         if merge_semantic_tokens and count_semantic_tokens > 0:
             print_semantic_token(semantic_label, count_semantic_tokens)
 
-        print()
+        print()

+ 1 - 1
fish_speech/i18n/locale/en_US.json

@@ -120,4 +120,4 @@
   "Normalization Result Preview (Currently Only Chinese)": "Normalization Result Preview (Currently Only Chinese)",
   "Text Normalization": "Text Normalization",
   "Select Example Audio": "Select Example Audio"
-}
+}

+ 1 - 1
fish_speech/i18n/locale/es_ES.json

@@ -120,4 +120,4 @@
   "Normalization Result Preview (Currently Only Chinese)": "Vista Previa del Resultado de Normalización (Actualmente Solo Chino)",
   "Text Normalization": "Normalización de Texto",
   "Select Example Audio": "Selecionar áudio de exemplo"
-}
+}

+ 1 - 1
fish_speech/i18n/locale/ja_JP.json

@@ -120,4 +120,4 @@
   "Normalization Result Preview (Currently Only Chinese)": "正規化結果プレビュー(現在は中国語のみ)",
   "Text Normalization": "テキスト正規化",
   "Select Example Audio": "サンプル音声を選択"
-}
+}

+ 1 - 1
fish_speech/i18n/locale/ko_KR.json

@@ -120,4 +120,4 @@
   "Normalization Result Preview (Currently Only Chinese)": "정규화 결과 미리보기(현재 중국어만 지원)",
   "Text Normalization": "텍스트 정규화",
   "Select Example Audio": "예시 오디오 선택"
-}
+}

+ 1 - 1
fish_speech/i18n/locale/pt_BR.json

@@ -129,4 +129,4 @@
   "No": "Não",
   "version:": "versão:",
   "author:": "autor:"
-}
+}

+ 1 - 1
fish_speech/i18n/locale/zh_CN.json

@@ -120,4 +120,4 @@
   "Normalization Result Preview (Currently Only Chinese)": "规范化结果预览",
   "Text Normalization": "文本规范化",
   "Select Example Audio": "选择参考音频"
-}
+}

+ 4 - 4
fish_speech/models/dac/modded_dac.py

@@ -994,9 +994,9 @@ class DAC(BaseModel, CodecMixin):
 
 if __name__ == "__main__":
     import hydra
-    import torch
     import numpy as np
     import soundfile as sf
+    import torch
     from omegaconf import OmegaConf
 
     # 配置路径
@@ -1004,7 +1004,7 @@ if __name__ == "__main__":
     checkpoint_path = "checkpoints/s2-pro/codec.pth"
     codes_path = "./output/codes_0.npy"  # 你的 codes 文件路径
     output_path = "reconstructed_from_codes.wav"
-    sample_rate = 44100 # 请确保采样率与模型训练时一致
+    sample_rate = 44100  # 请确保采样率与模型训练时一致
 
     with torch.inference_mode():
         # 1. 初始化模型
@@ -1028,11 +1028,11 @@ if __name__ == "__main__":
         # 3. 直接从 codes 重建音频 (Decoding)
         # 注意:fish_speech 的 model.from_indices 通常接受的输入是 LongTensor
         fake_audio = model.from_indices(codes_tensor)
-        
+
         # 4. 后处理与保存
         # fake_audio 形状通常为 [B, C, T]
         audio_np = fake_audio.squeeze().cpu().numpy()
-        
+
         # 如果是多声道,转置为 soundfile 要求的 (samples, channels)
         if len(audio_np.shape) == 2:
             audio_np = audio_np.T

+ 37 - 49
fish_speech/models/text2semantic/inference.py

@@ -47,7 +47,7 @@ def multinomial_sample_one_no_sync(
     return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
 
 
-RAS_WIN_SIZE = 10   # window for Repetition Aware Sampling
+RAS_WIN_SIZE = 10  # window for Repetition Aware Sampling
 RAS_HIGH_TEMP = 1.0
 RAS_HIGH_TOP_P = 0.9
 
@@ -116,23 +116,30 @@ def decode_one_token_ar(
     biased_logits = logits + semantic_logit_bias
 
     # Normal sample
-    main_token_normal = sample(biased_logits, temperature=temperature, top_p=top_p, top_k=top_k)[0]
+    main_token_normal = sample(
+        biased_logits, temperature=temperature, top_p=top_p, top_k=top_k
+    )[0]
 
     # RAS: also sample with high temp to use as fallback if token repeats
-    high_temp = torch.tensor(RAS_HIGH_TEMP, device=temperature.device, dtype=temperature.dtype)
+    high_temp = torch.tensor(
+        RAS_HIGH_TEMP, device=temperature.device, dtype=temperature.dtype
+    )
     high_top_p = torch.tensor(RAS_HIGH_TOP_P, device=top_p.device, dtype=top_p.dtype)
-    main_token_high = sample(biased_logits, temperature=high_temp, top_p=high_top_p, top_k=top_k)[0]
+    main_token_high = sample(
+        biased_logits, temperature=high_temp, top_p=high_top_p, top_k=top_k
+    )[0]
 
     # Use high-temp sample if: token is semantic AND token is in previous window
     if previous_tokens is not None:
         in_window = (previous_tokens[0] == main_token_normal).any()
         # Use tensor ops (&, torch.where) instead of Python (and, if) — torch.compile requires no data-dependent branching
-        is_semantic = (
-            (main_token_normal >= model.config.semantic_begin_id)
-            & (main_token_normal <= model.config.semantic_end_id)
+        is_semantic = (main_token_normal >= model.config.semantic_begin_id) & (
+            main_token_normal <= model.config.semantic_end_id
         )
         should_use_high = in_window & is_semantic
-        main_token_normal = torch.where(should_use_high, main_token_high, main_token_normal)
+        main_token_normal = torch.where(
+            should_use_high, main_token_high, main_token_normal
+        )
 
     codebooks = [main_token_normal]
 
@@ -144,7 +151,7 @@ def decode_one_token_ar(
 
     input_pos = torch.tensor([0], device=hidden_states.device, dtype=torch.long)
     model.forward_generate_fast(hidden_states, input_pos)
-    
+
     # [MODIFIED] Access config instead of tokenizer
     a = codebooks[0] - model.config.semantic_begin_id
     a[a < 0] = 0
@@ -158,7 +165,7 @@ def decode_one_token_ar(
         )
         logits = model.forward_generate_fast(hidden_states, input_pos)
 
-        short_logits = logits # DualAR predicts config.codebook_size number of tokens
+        short_logits = logits  # DualAR predicts config.codebook_size number of tokens
 
         # Convert logits to probs (no constrain for fast codebooks)
         a = sample(
@@ -200,7 +207,7 @@ def decode_n_tokens(
     )
     # Accumulate all generated tokens (the actual output)
     new_tokens = []
-    
+
     # [MODIFIED] Pre-fetch ID for efficiency loop
     im_end_id = model.tokenizer.get_token_id(IM_END_TOKEN)
 
@@ -223,7 +230,9 @@ def decode_n_tokens(
         cur_token = next_token.view(1, model.config.num_codebooks + 1, -1)
         # Roll RAS window left and insert new token at end
         previous_tokens = previous_tokens.roll(-1, dims=1)
-        previous_tokens[:, -1] = next_token.view(model.config.num_codebooks + 1, -1)[:, 0]
+        previous_tokens[:, -1] = next_token.view(model.config.num_codebooks + 1, -1)[
+            :, 0
+        ]
         new_tokens.append(next_token)
 
         if cur_token[0, 0, -1] == im_end_id:
@@ -270,7 +279,9 @@ def generate(
         max_new_tokens = T_new - T
 
     device = prompt.device
-    dtype = next(model.parameters()).dtype  # model weight dtype (bfloat16), NOT prompt dtype (int32)
+    dtype = next(
+        model.parameters()
+    ).dtype  # model weight dtype (bfloat16), NOT prompt dtype (int32)
 
     # Critical fix: Only set up cache on first run or when necessary
     if not hasattr(model, "_cache_setup_done") or not model._cache_setup_done:
@@ -304,12 +315,12 @@ def generate(
     semantic_logit_bias = torch.full(
         (1, 1, vocab_size), float("-inf"), device=device, dtype=dtype
     )
-    
+
     # [MODIFIED] Use config for semantic range
     semantic_logit_bias[
         0, 0, model.config.semantic_begin_id : model.config.semantic_end_id + 1
     ] = 0.0
-    
+
     # [MODIFIED] Use tokenizer.get_token_id (Wrapper method)
     semantic_logit_bias[0, 0, model.tokenizer.get_token_id(IM_END_TOKEN)] = 0.0
 
@@ -419,9 +430,7 @@ def encode_audio(audio_path, codec, device):
     wav, sr = torchaudio.load(str(audio_path))
     if wav.shape[0] > 1:
         wav = wav.mean(dim=0, keepdim=True)
-    wav = torchaudio.functional.resample(
-        wav.to(device), sr, codec.sample_rate
-    )[0]
+    wav = torchaudio.functional.resample(wav.to(device), sr, codec.sample_rate)[0]
 
     # Match codec model dtype (e.g. bfloat16)
     model_dtype = next(codec.parameters()).dtype
@@ -557,7 +566,6 @@ def generate_long(
     # Build base conversation with system message
     base_conversation = Conversation()
 
-
     if use_prompt:
         # Auto-add speaker tags to prompt texts that don't have them
         tagged_prompt_text = []
@@ -603,9 +611,7 @@ def generate_long(
     else:
         batches = [text]
 
-    logger.info(
-        f"Split into {len(turns)} turns, grouped into {len(batches)} batches"
-    )
+    logger.info(f"Split into {len(turns)} turns, grouped into {len(batches)} batches")
 
     for sample_idx in range(num_samples):
         if torch.cuda.is_available():
@@ -654,10 +660,8 @@ def generate_long(
                 merge_semantic_tokens=True,
             )
 
-            encoded, audio_masks, audio_parts = (
-                conversation_gen.encode_for_inference(
-                    tokenizer, num_codebooks=model.config.num_codebooks
-                )
+            encoded, audio_masks, audio_parts = conversation_gen.encode_for_inference(
+                tokenizer, num_codebooks=model.config.num_codebooks
             )
 
             logger.info(f"Encoded prompt shape: {encoded.shape}")
@@ -689,9 +693,7 @@ def generate_long(
             )
 
             if sample_idx == 0 and batch_idx == 0 and compile:
-                logger.info(
-                    f"Compilation time: {time.perf_counter() - t0:.2f} seconds"
-                )
+                logger.info(f"Compilation time: {time.perf_counter() - t0:.2f} seconds")
 
             if torch.cuda.is_available():
                 torch.cuda.synchronize()
@@ -723,9 +725,7 @@ def generate_long(
                 )
             )
 
-            yield GenerateResponse(
-                action="sample", codes=codes, text=batch_text
-            )
+            yield GenerateResponse(action="sample", codes=codes, text=batch_text)
 
             # Cleanup
             del y, encoded
@@ -868,19 +868,11 @@ def main(
         raise ValueError(
             "--prompt-text requires either --prompt-audio or --prompt-tokens"
         )
-    if (
-        prompt_text
-        and prompt_tokens
-        and len(prompt_text) != len(prompt_tokens)
-    ):
+    if prompt_text and prompt_tokens and len(prompt_text) != len(prompt_tokens):
         raise ValueError(
             f"Number of prompt text ({len(prompt_text)}) and prompt tokens ({len(prompt_tokens)}) should be the same"
         )
-    if (
-        prompt_text
-        and prompt_audio
-        and len(prompt_text) != len(prompt_audio)
-    ):
+    if prompt_text and prompt_audio and len(prompt_text) != len(prompt_audio):
         raise ValueError(
             f"Number of prompt text ({len(prompt_text)}) and prompt audio ({len(prompt_audio)}) should be the same"
         )
@@ -912,9 +904,7 @@ def main(
         prompt_tokens_list = [
             encode_audio(p, codec, device).cpu() for p in prompt_audio
         ]
-        logger.info(
-            f"Encoded {len(prompt_audio)} audio file(s) to VQ codes"
-        )
+        logger.info(f"Encoded {len(prompt_audio)} audio file(s) to VQ codes")
     elif prompt_tokens is not None:
         prompt_tokens_list = [torch.from_numpy(np.load(p)) for p in prompt_tokens]
 
@@ -958,9 +948,7 @@ def main(
                 if output:
                     if codec is None:
                         logger.info("Loading codec model for audio decoding...")
-                        codec = load_codec_model(
-                            codec_checkpoint, device, precision
-                        )
+                        codec = load_codec_model(codec_checkpoint, device, precision)
                     audio = decode_to_audio(merged_codes.to(device), codec)
                     import soundfile as sf
 
@@ -980,4 +968,4 @@ def main(
 
 
 if __name__ == "__main__":
-    main()
+    main()

+ 30 - 21
fish_speech/models/text2semantic/llama.py

@@ -47,7 +47,7 @@ class BaseModelArgs:
     # Codebook configs
     codebook_size: int = 160
     num_codebooks: int = 4
-    
+
     semantic_begin_id: int = 0
     semantic_end_id: int = 0
 
@@ -232,10 +232,14 @@ def _remap_fish_qwen3_omni_keys(weights: OrderedDict) -> OrderedDict:
     new_weights = OrderedDict()
     for k, v in weights.items():
         if k.startswith("text_model.model."):
-            new_key = k[len("text_model.model."):]
+            new_key = k[len("text_model.model.") :]
         elif k.startswith("audio_decoder."):
-            suffix = k[len("audio_decoder."):]
-            new_key = suffix if suffix.startswith("codebook_embeddings.") else "fast_" + suffix
+            suffix = k[len("audio_decoder.") :]
+            new_key = (
+                suffix
+                if suffix.startswith("codebook_embeddings.")
+                else "fast_" + suffix
+            )
         else:
             new_key = k
         new_weights[new_key] = v
@@ -329,12 +333,13 @@ class BaseTransformer(nn.Module):
             embeds.append(emb)
 
         vq_embeds_sum = torch.stack(embeds, dim=1).sum(dim=1)
-        
-        is_semantic = (inp[:, 0] >= self.config.semantic_begin_id) & \
-                      (inp[:, 0] <= self.config.semantic_end_id)
-        
+
+        is_semantic = (inp[:, 0] >= self.config.semantic_begin_id) & (
+            inp[:, 0] <= self.config.semantic_end_id
+        )
+
         vq_embeds_sum[~is_semantic] = 0
-        
+
         x = self.embeddings(inp[:, 0]) + vq_embeds_sum
 
         return x
@@ -374,9 +379,7 @@ class BaseTransformer(nn.Module):
             token_logits = self.output(slow_out)
 
         hidden_out = (
-            slow_out
-            if getattr(self.config, "norm_fastlayer_input", False)
-            else x
+            slow_out if getattr(self.config, "norm_fastlayer_input", False) else x
         )
 
         return BaseTransformerForwardResult(
@@ -392,7 +395,7 @@ class BaseTransformer(nn.Module):
         audio_parts: Optional[Tensor] = None,
         return_all: bool = False,
     ) -> BaseTransformerForwardResult:
-        
+
         # Embedding logic replicated from embed() for compilation compatibility
         embeds = []
         for i in range(self.config.num_codebooks):
@@ -454,9 +457,7 @@ class BaseTransformer(nn.Module):
             token_logits = self.output(slow_out)
 
         hidden_out = (
-            slow_out
-            if getattr(self.config, "norm_fastlayer_input", False)
-            else x
+            slow_out if getattr(self.config, "norm_fastlayer_input", False) else x
         )
 
         return BaseTransformerForwardResult(
@@ -499,9 +500,13 @@ class BaseTransformer(nn.Module):
             tokenizer = FishTokenizer.from_pretrained(path)
             config.semantic_begin_id = tokenizer.semantic_begin_id
             config.semantic_end_id = tokenizer.semantic_end_id
-            logger.info(f"Injected Semantic IDs into Config: {config.semantic_begin_id}-{config.semantic_end_id}")
+            logger.info(
+                f"Injected Semantic IDs into Config: {config.semantic_begin_id}-{config.semantic_end_id}"
+            )
         except Exception as e:
-            logger.warning(f"Failed to load tokenizer for config injection: {e}. Semantic IDs might be 0.")
+            logger.warning(
+                f"Failed to load tokenizer for config injection: {e}. Semantic IDs might be 0."
+            )
 
         match config.model_type:
             case "naive":
@@ -523,6 +528,7 @@ class BaseTransformer(nn.Module):
             if "int8" in str(Path(path)):
                 logger.info("Using int8 weight-only quantization!")
                 from tools.llama.quantize import WeightOnlyInt8QuantHandler
+
                 simple_quantizer = WeightOnlyInt8QuantHandler(model)
                 model = simple_quantizer.convert_for_runtime()
 
@@ -532,6 +538,7 @@ class BaseTransformer(nn.Module):
                 assert path_comps[-2].startswith("g")
                 groupsize = int(path_comps[-2][1:])
                 from tools.llama.quantize import WeightOnlyInt4QuantHandler
+
                 simple_quantizer = WeightOnlyInt4QuantHandler(model, groupsize)
                 model = simple_quantizer.convert_for_runtime()
 
@@ -543,6 +550,7 @@ class BaseTransformer(nn.Module):
             if index_json.exists():
                 logger.info("Loading sharded safetensors weights")
                 from safetensors.torch import load_file as st_load_file
+
                 with open(index_json) as f:
                     st_index = json.load(f)
                 shard_files = sorted(set(st_index["weight_map"].values()))
@@ -553,6 +561,7 @@ class BaseTransformer(nn.Module):
             elif single_st.exists():
                 logger.info("Loading single safetensors weights")
                 from safetensors.torch import load_file as st_load_file
+
                 weights = OrderedDict(st_load_file(str(single_st), device="cpu"))
                 weights = _remap_fish_qwen3_omni_keys(weights)
             elif pth_file.exists():
@@ -738,12 +747,12 @@ class DualARTransformer(BaseTransformer):
 
         # Extract corresponding parts with labels
         token_labels = labels[:, 0]
-        
+
         # [MODIFIED] Use config instead of tokenizer
         codebook_mask = (token_labels >= self.config.semantic_begin_id) & (
             token_labels <= self.config.semantic_end_id
         )
-        
+
         # This gives where input token is <|semantic|>
         x = x[codebook_mask]
 
@@ -1023,4 +1032,4 @@ def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
     )
 
     x_out2 = x_out2.flatten(3)
-    return x_out2.type_as(x)
+    return x_out2.type_as(x)

+ 35 - 16
fish_speech/tokenizer.py

@@ -33,32 +33,44 @@ MODALITY_TOKENS = {
 }
 
 SEMANTIC_TOKEN_TEMPLATE = "<|semantic:{i}|>"
-SEMANTIC_TOKENS =[SEMANTIC_TOKEN_TEMPLATE.format(i=i) for i in range(4096)]
-
-ALL_SPECIAL_TOKENS =[
-    EOS_TOKEN, PAD_TOKEN, IM_START_TOKEN, IM_END_TOKEN,
-    PHONEME_START_TOKEN, PHONEME_END_TOKEN, MODALITY_TEXT_TOKEN,
-    MODALITY_VOICE_TOKEN, MODALITY_INTERLEAVE_TOKEN, AUDIO_START_TOKEN,
-    AUDIO_END_TOKEN, AUDIO_EMBED_TOKEN, *SEMANTIC_TOKENS,
+SEMANTIC_TOKENS = [SEMANTIC_TOKEN_TEMPLATE.format(i=i) for i in range(4096)]
+
+ALL_SPECIAL_TOKENS = [
+    EOS_TOKEN,
+    PAD_TOKEN,
+    IM_START_TOKEN,
+    IM_END_TOKEN,
+    PHONEME_START_TOKEN,
+    PHONEME_END_TOKEN,
+    MODALITY_TEXT_TOKEN,
+    MODALITY_VOICE_TOKEN,
+    MODALITY_INTERLEAVE_TOKEN,
+    AUDIO_START_TOKEN,
+    AUDIO_END_TOKEN,
+    AUDIO_EMBED_TOKEN,
+    *SEMANTIC_TOKENS,
 ]
 
+
 class FishTokenizer:
     def __init__(self, model_path: str):
         self._tokenizer = AutoTokenizer.from_pretrained(model_path)
         self.semantic_id_to_token_id = {}
-        
+
         vocab = self._tokenizer.get_vocab()
-        valid_ids =[]
-        
+        valid_ids = []
+
         for code_idx in range(4096):
             token = SEMANTIC_TOKEN_TEMPLATE.format(i=code_idx)
             if token in vocab:
                 token_id = vocab[token]
                 self.semantic_id_to_token_id[code_idx] = token_id
                 valid_ids.append(token_id)
-        
+
         if not valid_ids:
-            logger.error("CRITICAL ERROR: No semantic tokens found in vocab! Audio cannot be synthesized.")
+            logger.error(
+                "CRITICAL ERROR: No semantic tokens found in vocab! Audio cannot be synthesized."
+            )
             self.semantic_begin_id = 0
             self.semantic_end_id = 0
             # Dummy tensor to prevent crash, though generation will fail
@@ -71,7 +83,9 @@ class FishTokenizer:
             for k, v in self.semantic_id_to_token_id.items():
                 self.semantic_map_tensor[k] = v
 
-        logger.info(f"Loaded Tokenizer. Semantic Range: {self.semantic_begin_id} -> {self.semantic_end_id}")
+        logger.info(
+            f"Loaded Tokenizer. Semantic Range: {self.semantic_begin_id} -> {self.semantic_end_id}"
+        )
 
     @property
     def vocab_size(self):
@@ -88,13 +102,18 @@ class FishTokenizer:
     def get_token_id(self, token: str) -> int:
         return self._tokenizer.convert_tokens_to_ids(token)
 
-    def encode(self, text: str, add_special_tokens: bool = False, **kwargs) -> List[int]:
+    def encode(
+        self, text: str, add_special_tokens: bool = False, **kwargs
+    ) -> List[int]:
         # [FIX] Force Qwen/Tiktoken backends to parse special tokens inline
         import inspect
+
         sig = inspect.signature(self._tokenizer.encode)
         if "allowed_special" in sig.parameters and "allowed_special" not in kwargs:
             kwargs["allowed_special"] = "all"
-        return self._tokenizer.encode(text, add_special_tokens=add_special_tokens, **kwargs)
+        return self._tokenizer.encode(
+            text, add_special_tokens=add_special_tokens, **kwargs
+        )
 
     def decode(self, tokens: Union[List[int], int], **kwargs) -> str:
         return self._tokenizer.decode(tokens, **kwargs)
@@ -107,4 +126,4 @@ class FishTokenizer:
         return cls(path)
 
     def __getattr__(self, name):
-        return getattr(self._tokenizer, name)
+        return getattr(self._tokenizer, name)

+ 1 - 0
mkdocs.yml

@@ -59,6 +59,7 @@ nav:
   - Installation: en/install.md
   - Finetune: en/finetune.md
   - Inference: en/inference.md
+  - Server: en/server.md
   - Samples: en/samples.md
 
 # Plugins

+ 2 - 4
pyproject.toml

@@ -1,8 +1,8 @@
 [project]
 name = "fish-speech"
-version = "0.1.0"
+version = "2.0.0"
 authors = [
-    {name = "Lengyue", email = "lengyue@lengyue.me"},
+    {name = "Fish Audio", email = "oss@fish.audio"},
 ]
 description = "Fish Speech"
 readme = "README.md"
@@ -116,8 +116,6 @@ name = "pytorch-cu129"
 url = "https://download.pytorch.org/whl/cu129"
 explicit = true
 
-
-
 [build-system]
 requires = ["setuptools", "setuptools-scm"]
 build-backend = "setuptools.build_meta"

A diferenza do arquivo foi suprimida porque é demasiado grande
+ 7 - 2
uv.lock


Algúns arquivos non se mostraron porque demasiados arquivos cambiaron neste cambio