Bladeren bron

S2 beta (#1167)

* Update to support S2 model.

* fix gradio webui bug.

* update docs and license for S2 Model.

* Fix torch.compile and DAC bugs.

* Fix LICENSE.

* fix pyproject.toml bug.

* [fix]:fix hf style ckpy load problem

* [fix]:fix docker and docs

* [docs]:Add docs

* [docs]:update readme and docs

* fix typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [fix]fix readme

* [docs]:fix typo

* [docs]: bold sglang server

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update FishAudioS2 Tecnical Report

---------

Co-authored-by: PoTaTo-Mika <1228427403@qq.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Whale and Dolphin 1 maand geleden
bovenliggende
commit
3578e4e709
19 gewijzigde bestanden met toevoegingen van 683 en 271 verwijderingen
  1. 1 0
      .gitignore
  2. BIN
      FishAudioS2TecReport.pdf
  3. 12 10
      README.md
  4. 6 4
      docs/README.ar.md
  5. 6 4
      docs/README.ja.md
  6. 6 4
      docs/README.ko.md
  7. 6 4
      docs/README.pt-BR.md
  8. 6 4
      docs/README.zh.md
  9. 77 40
      docs/ar/index.md
  10. 29 0
      docs/assets/logo.svg
  11. 2 1
      docs/en/finetune.md
  12. 79 43
      docs/en/index.md
  13. 3 3
      docs/en/install.md
  14. 74 37
      docs/ja/index.md
  15. 76 39
      docs/ko/index.md
  16. 74 37
      docs/pt/index.md
  17. 65 27
      docs/zh/index.md
  18. 158 11
      docs/zh/install.md
  19. 3 3
      mkdocs.yml

+ 1 - 0
.gitignore

@@ -96,6 +96,7 @@ filelists/
 .pgx.*
 *log
 *.log
+site/
 
 # External Tools
 # --------------

BIN
FishAudioS2TecReport.pdf


+ 12 - 10
README.md

@@ -4,7 +4,6 @@
 **English** | [简体中文](docs/README.zh.md) | [Portuguese](docs/README.pt-BR.md) | [日本語](docs/README.ja.md) | [한국어](docs/README.ko.md) | [العربية](docs/README.ar.md) <br>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
-</a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
     <img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
 </a>
@@ -31,17 +30,20 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
-    </a>
     <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
+    </a>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
+    </a>
 </div>
 
 > [!IMPORTANT]
 > **License Notice**  
-> This codebase and its associated model weights are released under **[FISH AUDIO RESEARCH LICENSE](LICENSE)**. Please refer to [LICENSE](LICENSE) for more details.
+> This codebase and its associated model weights are released under **[FISH AUDIO RESEARCH LICENSE](LICENSE)**. Please refer to [LICENSE](LICENSE) for more details. We will take action against any violation of the license.
 
 > [!WARNING]
 > **Legal Disclaimer**  
@@ -65,7 +67,7 @@ Here are the official documents for Fish Audio S2, follow the instructions to ge
 ### For LLM Agent
 
 ```
-Install and configure Fish-Audio S2 by following the instructions here:https://speech.fish.audio/install/
+Install and configure Fish-Audio S2 by following the instructions here: https://speech.fish.audio/install/
 ```
 
 ## Fish Audio S2  
@@ -75,13 +77,13 @@ Fish Audio S2 is the latest model developed by [Fish Audio](https://fish.audio/)
 
 S2 supports fine-grained inline control of prosody and emotion using natural-language tags like `[laugh]`, `[whispers]`, and `[super happy]`, as well as native multi-speaker and multi-turn generation.
 
-Visit the [Fish Audio website](https://fish.audio/) for live playground. Read the [blog post](https://fish.audio/blog/fish-audio-open-sources-s2/) for more details.
+Visit the [Fish Audio website](https://fish.audio/) for live playground. Read the [blog post](https://fish.audio/blog/fish-audio-open-sources-s2/) and [tecnical report](https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf) for more details.
 
 ### Model Variants
 
 | Model | Size | Availability | Description |
 |------|------|-------------|-------------|
-| S2-Pro | 4B parameters | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Full-featured flagship model with maximum quality and stability | 
+| S2-Pro | 4B parameters | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Full-featured flagship model with maximum quality and stability |
 
 More details of the model can be found in the [technical report](https://arxiv.org/abs/2411.01156).
 
@@ -153,8 +155,8 @@ Thanks to the expansion of the model context, our model can now use previous inf
 
 ### Rapid Voice Cloning
 
-Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning. Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the sglang server.
-
+Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
+Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the SGLang server.
 ---
 
 ## Credits

+ 6 - 4
docs/README.ar.md

@@ -4,7 +4,6 @@
 [English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md) | [日本語](README.ja.md) | [한국어](README.ko.md) | **العربية** <br>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
-</a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
     <img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
 </a>
@@ -31,12 +30,15 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
-    </a>
     <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
+    </a>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
+    </a>
 </div>
 
 > [!IMPORTANT]

+ 6 - 4
docs/README.ja.md

@@ -4,7 +4,6 @@
 [English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md) | **日本語** | [한국어](README.ko.md) | [العربية](README.ar.md) <br>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
-</a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
     <img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
 </a>
@@ -31,12 +30,15 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
-    </a>
     <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
+    </a>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
+    </a>
 </div>
 
 > [!IMPORTANT]

+ 6 - 4
docs/README.ko.md

@@ -4,7 +4,6 @@
 [English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md) | [日本語](README.ja.md) | **한국어** | [العربية](README.ar.md) <br>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
-</a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
     <img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
 </a>
@@ -31,12 +30,15 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
-    </a>
     <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
+    </a>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
+    </a>
 </div>
 
 > [!IMPORTANT]

+ 6 - 4
docs/README.pt-BR.md

@@ -4,7 +4,6 @@
 [English](../README.md) | [简体中文](README.zh.md) | **Portuguese** | [日本語](README.ja.md) | [한국어](README.ko.md) | [العربية](README.ar.md) <br>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
-</a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
     <img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
 </a>
@@ -31,12 +30,15 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
-    </a>
     <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
+    </a>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
+    </a>
 </div>
 
 > [!IMPORTANT]

+ 6 - 4
docs/README.zh.md

@@ -4,7 +4,6 @@
 [English](../README.md) | **简体中文** | [Portuguese](README.pt-BR.md) | [日本語](README.ja.md) | [한국어](README.ko.md) | [العربية](README.ar.md) <br>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
-</a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
     <img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
 </a>
@@ -31,12 +30,15 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
-    </a>
     <a target="_blank" href="https://huggingface.co/fishaudio/s2">
         <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
+    </a>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
+    </a>
 </div>
 
 > [!IMPORTANT]

+ 77 - 40
docs/ar/index.md

@@ -1,7 +1,7 @@
 <div align="center">
 <h1>Fish Speech</h1>
 
-[English](../en/) | [简体中文](../zh/) | [Portuguese](../pt/) | [日本語](../ja/) | [한국어](../ko/) | **العربية** <br>
+<p><a href="../en/">English</a> | <a href="../zh/">简体中文</a> | <a href="../pt/">Portuguese</a> | <a href="../ja/">日本語</a> | <a href="../ko/">한국어</a> | <strong>العربية</strong></p>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
@@ -30,14 +30,14 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
+        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/spaces/fishaudio/fish-speech-1">
-        <img alt="Huggingface" src="https://img.shields.io/badge/🤗%20-space%20demo-yellow"/>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
-        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
     </a>
 </div>
 
@@ -47,75 +47,113 @@
 !!! warning "إخلاء المسؤولية القانونية"
     نحن لا نتحمل أي مسؤولية عن أي استخدام غير قانوني لقاعدة الأكواد. يرجى مراجعة القوانين المحلية المتعلقة بـ DMCA والقوانين الأخرى ذات الصلة.
 
-## ابدأ من هنا
+## البدء السريع
+
+### ابدأ من الوثائق
 
-هذا هو الوثائق الرسمية لـ Fish Speech. يرجى اتباع التعليمات للبدء بسهولة.
+هذه هي الوثائق الرسمية لـ Fish Audio S2، ويمكنك البدء مباشرة عبر الروابط التالية:
 
-- [التثبيت](install.md)
-- [الاستدلال عبر سطر الأوامر](inference.md)
-- [استدلال WebUI](inference.md)
-- [الاستدلال عبر الخادم](server.md)
-- [إعداد Docker](install.md)
+- [التثبيت](https://speech.fish.audio/ar/install/)
+- [الاستدلال عبر سطر الأوامر](https://speech.fish.audio/ar/inference/)
+- [استدلال WebUI](https://speech.fish.audio/ar/inference/)
+- [الاستدلال عبر الخادم](https://speech.fish.audio/ar/server/)
+- [إعداد Docker](https://speech.fish.audio/ar/install/)
 
-!!! note
-    بالنسبة لخادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).
+> [!IMPORTANT]
+> **بالنسبة لخادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).**
 
 ### دليل وكلاء LLM
 
-```text
+```
 قم بتثبيت وإعداد Fish Audio S2 باتباع التعليمات في https://speech.fish.audio/ar/install/ .
 ```
 
 ## Fish Audio S2
-**أفضل نظام لتحويل النص إلى كلام في كل من المصادر المفتوحة والمغلقة**
-
-Fish Audio S2 هو أحدث نموذج تم تطويره بواسطة [Fish Audio](https://fish.audio/)، وهو مصمم لتوليد كلام يبدو طبيعيًا وأصليًا وغنيًا بالعاطفة — غير ميكانيكي أو مسطح أو مقتصر على القراءة بأسلوب الاستوديو.
+**أفضل نظام لتحويل النص إلى كلام بين الأنظمة مفتوحة المصدر ومغلقة المصدر**
 
-يركز Fish Audio S2 على المحادثات اليومية، ويدعم توليد المتحدثين المتعددين الأصليين وتوليد الحوارات متعددة الأدوار. كما يدعم التحكم التعليمي.
+Fish Audio S2 هو أحدث نموذج من [Fish Audio](https://fish.audio/). تم تدريبه على أكثر من 10 ملايين ساعة صوتية عبر نحو 50 لغة، ويجمع بين المواءمة بالتعلم المعزز وبنية Dual-Autoregressive لإنتاج كلام طبيعي وواقعي وغني بالتعبير العاطفي.
 
-تتضمن سلسلة S2 نماذج متعددة. النموذج المفتوح المصدر هو S2-Pro، وهو أقوى نموذج في السلسلة.
+يدعم S2 التحكم الدقيق في النبرة والعاطفة داخل النص نفسه باستخدام وسوم باللغة الطبيعية مثل `[laugh]` و`[whispers]` و`[super happy]`، كما يدعم بشكل أصيل توليد متحدثين متعددين وحوارات متعددة الأدوار.
 
رجى زيارة [موقع Fish Audio](https://fish.audio/) لتجربة فورية.
مكنك تجربة النموذج مباشرة عبر [موقع Fish Audio](https://fish.audio/)، وقراءة المزيد في [منشور المدونة](https://fish.audio/blog/fish-audio-open-sources-s2/).
 
-### متغيرات النموذج
+### إصدارات النموذج
 
 | النموذج | الحجم | التوفر | الوصف |
 |------|------|-------------|-------------|
-| S2-Pro | 4B معاملات | [huggingface](https://huggingface.co/fishaudio/s2-pro) | نموذج رائد بكامل الميزات مع أعلى جودة واستقرار |
+| S2-Pro | 4B معلمة | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | نموذج رائد كامل الميزات بأعلى مستوى من الجودة والاستقرار |
 
-لمزيد من التفاصيل حول النماذج ، يرجى مراجعة التقرير الفني.
+يمكن العثور على مزيد من التفاصيل في [التقرير التقني](https://arxiv.org/abs/2411.01156).
+
+## نتائج القياس المعياري
+
+| المعيار | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (الصينية) | **0.54%** (الأفضل إجمالاً) |
+| Seed-TTS Eval — WER (الإنجليزية) | **0.99%** (الأفضل إجمالاً) |
+| Audio Turing Test (مع التعليمات) | **0.515** المتوسط البعدي |
+| EmergentTTS-Eval — معدل الفوز | **81.88%** (الأعلى إجمالاً) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — الجودة | **4.51 / 5.0** |
+| متعدد اللغات (MiniMax Testset) — أفضل WER | **11 من 24** لغة |
+| متعدد اللغات (MiniMax Testset) — أفضل SIM | **17 من 24** لغة |
+
+في Seed-TTS Eval، حقق S2 أقل WER بين جميع النماذج التي تم تقييمها، بما في ذلك الأنظمة المغلقة: Qwen3-TTS ‏(0.77/1.24)، وMiniMax Speech-02 ‏(0.99/1.90)، وSeed-TTS ‏(1.12/2.25). وفي Audio Turing Test، تفوقت قيمة 0.515 على Seed-TTS ‏(0.417) بنسبة 24% وعلى MiniMax-Speech ‏(0.387) بنسبة 33%. وفي EmergentTTS-Eval، حقق S2 نتائج قوية بشكل خاص في الخصائص شبه اللغوية (91.61%)، والأسئلة (84.41%)، والتعقيد النحوي (83.39%).
 
 ## أبرز المميزات
 
 <img src="../assets/totalability.png" width=200%>
 
-### التحكم باللغة الطبيعية
+### تحكم مضمّن دقيق عبر اللغة الطبيعية
+
+يتيح Fish Audio S2 تحكمًا موضعيًا في توليد الكلام من خلال تضمين تعليمات باللغة الطبيعية مباشرة عند مواقع كلمات أو عبارات محددة داخل النص. وبدلًا من الاعتماد على مجموعة ثابتة من الوسوم المُعرّفة مسبقًا، يقبل S2 أوصافًا نصية حرة مثل [whisper in small voice] أو [professional broadcast tone] أو [pitch up]، مما يتيح تحكمًا مفتوحًا في التعبير على مستوى الكلمة.
+
+### بنية Dual-Autoregressive
 
-يسمح Fish Audio S2 للمستخدمين باستخدام اللغة الطبيعية للتحكم في أداء كل جملة ، والمعلومات غير اللفظية ، والعواطف ، والمزيد من خصائص الصوت ، بدلاً من مجرد استخدام علامات قصيرة للتحكم بشكل غامض في أداء النموذج. يؤدي ذلك إلى تحسين الجودة الإجمالية للمحتوى المولّد بشكل كبير.
+يعتمد S2 على Transformer أحادي الاتجاه (Decoder-only) مع مُرمّز صوتي قائم على RVQ (عدد 10 codebooks وبمعدل إطارات يقارب 21 هرتز). وتُقسّم بنية Dual-AR عملية التوليد إلى مرحلتين:
+
+- **Slow AR** يعمل على المحور الزمني ويتنبأ بالـ semantic codebook الأساسي.
+- **Fast AR** يولّد الـ 9 residual codebooks المتبقية في كل خطوة زمنية لإعادة بناء التفاصيل الصوتية الدقيقة.
+
+هذا التصميم غير المتماثل (4B معلمة على المحور الزمني و400M على محور العمق) يرفع كفاءة الاستدلال مع الحفاظ على جودة الصوت.
+
+### المواءمة بالتعلم المعزز
+
+يستخدم S2 خوارزمية Group Relative Policy Optimization (GRPO) للمواءمة بعد التدريب. ويتم إعادة استخدام نفس النماذج التي استُخدمت لتصفية بيانات التدريب وتعليقها كنماذج مكافأة في التعلم المعزز مباشرة، مما يلغي عدم تطابق التوزيع بين بيانات ما قبل التدريب وأهداف ما بعد التدريب. وتجمع إشارة المكافأة بين الدقة الدلالية، والالتزام بالتعليمات، وتقييم التفضيل الصوتي، وتشابه النبرة.
+
+### البث الإنتاجي عبر SGLang
+
+لأن بنية Dual-AR متماثلة بنيويًا مع نماذج LLM autoregressive القياسية، فإن S2 يرث مباشرة تحسينات الخدمة الأصلية في SGLang، بما في ذلك: continuous batching، وpaged KV cache، وCUDA graph replay، وprefix caching المعتمد على RadixAttention.
+
+على بطاقة NVIDIA H200 واحدة:
+
+- **عامل الزمن الحقيقي (RTF):** 0.195
+- **الزمن حتى أول مقطع صوتي:** حوالي 100 مللي ثانية
+- **معدل المعالجة:** أكثر من 3,000 acoustic tokens/s مع الحفاظ على RTF أقل من 0.5
 
 ### دعم لغات متعددة
 
-يدعم Fish Audio S2 تحويل النص إلى كلام متعدد اللغات بجودة عالية دون الحاجة إلى وحدات صوتية أو معالجة مسبقة خاصة باللغة. يشمل ذلك:
+يدعم Fish Audio S2 تحويل النص إلى كلام بجودة عالية ولغات متعددة دون الحاجة إلى رموز صوتية أو معالجة مسبقة خاصة بكل لغة. بما في ذلك:
 
-**الإنجليزية ، الصينية ، اليابانية ، الكورية ، العربية ، الألمانية ، الفرنسية ...**
+**الإنجليزية، الصينية، اليابانية، الكورية، العربية، الألمانية، الفرنسية...**
 
-**والمزيد في المستقبل!**
+**وأكثر من ذلك بكثير!**
 
-القائمة تتوسع باستمرار ، يرجى التحقق من [Fish Audio](https://fish.audio/) للحصول على أحدث الإصدارات.
+القائمة في توسع مستمر، تحقق من [Fish Audio](https://fish.audio/) لمعرفة أحدث الإصدارات.
 
-### توليد المتحدثين المتعددين الأصليين
+### توليد أصلي لمتحدثين متعددين
 
 <img src="../assets/chattemplate.png" width=200%>
 
-يسمح Fish Audio S2 للمستخدمين بتحميل عينات صوتية مرجعية تحتوي على متحدثين متعددين ، وسيقوم النموذج بمعالجة خصائص كل متحدث من خلال رمز `<|speaker:i|>`. بعد ذلك ، يمكنك التحكم في أداء النموذج عبر رموز معرف المتحدث ، مما يحقق تعدد المتحدثين في عملية توليد واحدة. لا داعي بعد الآن لتحميل أصوات مرجعية وتوليد كلام لكل متحدث على حدة.
+يسمح Fish Audio S2 للمستخدمين برفع صوت مرجعي يحتوي على متحدثين متعددين، وسيتعامل النموذج مع ميزات كل متحدث عبر رمز `<|speaker:i|>`. يمكنك بعد ذلك التحكم في أداء النموذج باستخدام رمز معرف المتحدث، مما يسمح بتوليد واحد يتضمن متحدثين متعددين. لم تعد بحاجة لرفع ملفات مرجعية منفصلة لكل متحدث.
 
-### توليد الحوارات متعددة الأدوار
+### توليد حوارات متعددة الأدوار
 
-بفضل توسيع سياق النموذج ، يمكن لنموذجنا الآن استخدام معلومات السياق السابق لتحسين التعبير عن المحتوى المولّد لاحقًا ، وبالتالي زيادة طبيعية المحتوى.
+بفضل توسيع سياق النموذج، يمكن لنموذجنا الآن استخدام المعلومات السابقة لتحسين التعبير في المحتوى المولد لاحقاً، مما يزيد من طبيعية المحتوى.
 
-### استنساخ الصوت السريع
+### استنساخ صوت سريع
 
-يدعم Fish Audio S2 استنساخ الصوت الدقيق باستخدام عينات مرجعية قصيرة (عادة 10-30 ثانية). يمكن للنموذج التقاط نبرة الصوت وأسلوب التحدث والميل العاطفي ، وتوليد أصوات مستنسخة واقعية ومتسقة دون ضبط دقيق إضافي.
+يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
 لاستخدام خادم SGLang، راجع https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
 
 ---
@@ -130,8 +168,7 @@ Fish Audio S2 هو أحدث نموذج تم تطويره بواسطة [Fish Audi
 - [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
 - [Qwen3](https://github.com/QwenLM/Qwen3)
 
-## التقرير الفني
-
+## التقرير التقني
 ```bibtex
 @misc{fish-speech-v1.4,
       title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},

+ 29 - 0
docs/assets/logo.svg

@@ -0,0 +1,29 @@
+<svg xmlns="http://www.w3.org/2000/svg" version="1.2" viewBox="0 0 512 512" width="512" height="512">
+    <style>
+        .bar { fill: #000; }
+        .lbar { fill: #b1b3b4; }
+    </style>
+
+    <rect class="lbar" x="456.7" y="249.9" width="16" height="27.1" rx="8"/>
+    <rect class="lbar" x="424.7" y="256.6" width="16" height="47.4" rx="8"/>
+    <rect class="lbar" x="392.9" y="264.3" width="16" height="52.1" rx="8"/>
+    <rect class="lbar" x="360.9" y="270.7" width="16" height="51.6" rx="8"/>
+    <rect class="lbar" x="328.9" y="276.4" width="16" height="46" rx="8"/>
+    <rect class="lbar" x="297.1" y="299.3" width="16" height="21.4" rx="8"/>
+
+    <rect class="bar" x="38.1" y="200" width="16" height="19.4" rx="8"/>
+    <rect class="bar" x="71" y="202.7" width="16" height="30.7" rx="8"/>
+    <rect class="bar" x="103.9" y="198.4" width="16" height="77.4" rx="8"/>
+    <rect class="bar" x="136.9" y="192" width="16" height="20" rx="8"/>
+    <rect class="bar" x="136.9" y="245.4" width="16" height="58.3" rx="8"/>
+    <rect class="bar" x="424.7" y="185.2" width="16" height="60.2" rx="8"/>
+    <rect class="bar" x="392.9" y="178.1" width="16" height="75.4" rx="8"/>
+    <rect class="bar" x="360.9" y="175" width="16" height="86.6" rx="8"/>
+    <rect class="bar" x="328.9" y="177" width="16" height="87.8" rx="8"/>
+    <rect class="bar" x="297.1" y="181.9" width="16" height="107.1" rx="8"/>
+    <rect class="bar" x="264.5" y="190.2" width="16" height="120.1" rx="8"/>
+    <rect class="bar" x="232.6" y="204.1" width="16" height="115.8" rx="8"/>
+    <rect class="bar" x="200.6" y="222.4" width="16" height="102.2" rx="8"/>
+    <rect class="bar" x="456.7" y="204.1" width="16" height="38" rx="8"/>
+    <rect class="bar" x="168.6" y="235.1" width="16" height="100.8" rx="8"/>
+</svg>

+ 2 - 1
docs/en/finetune.md

@@ -1,6 +1,7 @@
 # Fine-tuning
 
-Obviously, when you opened this page, you were not satisfied with the performance of the zero-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
+!!! warning 
+    We highly do note recoomand users to do fine-tuning on an RL trained model. Fine-tuning a model after RL can shift the model distribution, which may lead to degraded performance.
 
 In the current version, you only need to finetune the 'LLAMA' part.
 

+ 79 - 43
docs/en/index.md

@@ -1,7 +1,7 @@
 <div align="center">
 <h1>Fish Speech</h1>
 
-**English** | [简体中文](../zh/) | [Portuguese](../pt/) | [日本語](../ja/) | [한국어](../ko/) | [العربية](../ar/) <br>
+<p><strong>English</strong> | <a href="../zh/">简体中文</a> | <a href="../pt/">Portuguese</a> | <a href="../ja/">日本語</a> | <a href="../ko/">한국어</a> | <a href="../ar/">العربية</a></p>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
@@ -30,97 +30,134 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
+        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/spaces/fishaudio/fish-speech-1">
-        <img alt="Huggingface" src="https://img.shields.io/badge/🤗%20-space%20demo-yellow"/>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
-        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
     </a>
 </div>
 
 !!! info "License Notice"
-    This codebase and its associated model weights are released under **FISH AUDIO RESEARCH LICENSE**. Please refer to [LICENSE](https://github.com/fishaudio/fish-speech/blob/main/LICENSE) for more details.
+    This codebase and its associated model weights are released under **FISH AUDIO RESEARCH LICENSE**. Please refer to [LICENSE](https://github.com/fishaudio/fish-speech/blob/main/LICENSE) for more details. We will take action against any violation of the license.
 
 !!! warning "Legal Disclaimer"
     We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
 
-## Get Started
+## Quick Start
+
+### For Human
 
-This is the official documentation for Fish Speech. Please follow the instructions to get started easily.
+Here are the official documents for Fish Audio S2, follow the instructions to get started easily.
 
-- [Installation](install.md)
-- [Command Line Inference](inference.md#command-line-inference)
-- [WebUI Inference](inference.md#webui-inference)
-- [Server Inference](server.md)
-- [Docker Setup](install.md#docker-setup)
+- [Installation](https://speech.fish.audio/install/)
+- [Command Line Inference](https://speech.fish.audio/inference/#command-line-inference)
+- [WebUI Inference](https://speech.fish.audio/inference/#webui-inference)
+- [Server Inference](https://speech.fish.audio/server/)
+- [Docker Setup](https://speech.fish.audio/install/#docker-setup)
 
-!!! note
-    For SGLang server, please read [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).
+> [!IMPORTANT]
+> **For SGLang server, please read [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).**
 
 ### For LLM Agent
 
-```text
+```
 Install and configure Fish-Audio S2 by following the instructions here: https://speech.fish.audio/install/
 ```
 
-## Fish Audio S2
-**The best text-to-speech system in both open-source and closed-source**
-
-Fish Audio S2 is the latest model developed by [Fish Audio](https://fish.audio/), designed to generate speech that sounds natural, authentic, and emotionally rich—not mechanical, flat, or confined to studio-style reading.
+## Fish Audio S2  
+**Best text-to-speech system among both open source and closed source**
 
-Fish Audio S2 focuses on everyday conversations, supporting native multi-speaker and multi-round generation. It also supports instruction control.
+Fish Audio S2 is the latest model developed by [Fish Audio](https://fish.audio/). Trained on over 10 million hours of audio across approximately 50 languages, S2 combines reinforcement learning alignment with a Dual-Autoregressive architecture to generate speech that sounds natural, realistic, and emotionally rich.
 
-The S2 series includes multiple models. The open-source model is S2-Pro, which is the most powerful model in the series.
+S2 supports fine-grained inline control of prosody and emotion using natural-language tags like `[laugh]`, `[whispers]`, and `[super happy]`, as well as native multi-speaker and multi-turn generation.
 
-Please visit the [Fish Audio website](https://fish.audio/) for a real-time experience.
+Visit the [Fish Audio website](https://fish.audio/) for live playground. Read the [blog post](https://fish.audio/blog/fish-audio-open-sources-s2/) for more details.
 
 ### Model Variants
 
 | Model | Size | Availability | Description |
 |------|------|-------------|-------------|
-| S2-Pro | 4B Parameters | [huggingface](https://huggingface.co/fishaudio/s2-pro) | Full-featured flagship model with the highest quality and stability |
+| S2-Pro | 4B parameters | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Full-featured flagship model with maximum quality and stability |
 
-For more details on the models, please see the technical report.
+More details of the model can be found in the [technical report](https://arxiv.org/abs/2411.01156).
+
+## Benchmark Results
+
+| Benchmark | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (Chinese) | **0.54%** (best overall) |
+| Seed-TTS Eval — WER (English) | **0.99%** (best overall) |
+| Audio Turing Test (with instruction) | **0.515** posterior mean |
+| EmergentTTS-Eval — Win Rate | **81.88%** (highest overall) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — Quality | **4.51 / 5.0** |
+| Multilingual (MiniMax Testset) — Best WER | **11 of 24** languages |
+| Multilingual (MiniMax Testset) — Best SIM | **17 of 24** languages |
+
+On Seed-TTS Eval, S2 achieves the lowest WER among all evaluated models including closed-source systems: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). On the Audio Turing Test, 0.515 surpasses Seed-TTS (0.417) by 24% and MiniMax-Speech (0.387) by 33%. On EmergentTTS-Eval, S2 achieves particularly strong results in paralinguistics (91.61% win rate), questions (84.41%), and syntactic complexity (83.39%).
 
 ## Highlights
 
 <img src="../assets/totalability.png" width=200%>
 
-### Natural Language Control
+### Fine-Grained Inline Control via Natural Language
+
+S2 enables localized control over speech generation by embedding natural-language instructions directly at specific word or phrase positions within the text. Rather than relying on a fixed set of predefined tags, S2 accepts free-form textual descriptions — such as `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]` — allowing open-ended expression control at the word level.
+
+### Dual-Autoregressive Architecture
 
-Fish Audio S2 allows users to use natural language to control the performance, paralinguistic information, emotions, and more voice characteristics of each sentence, instead of just using short tags to vaguely control the model's performance. This greatly improves the overall quality of the generated content.
+S2 builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate). The Dual-AR architecture splits generation into two stages:
+
+- **Slow AR** operates along the time axis and predicts the primary semantic codebook.
+- **Fast AR** generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.
+
+This asymmetric design — 4B parameters along the time axis, 400M parameters along the depth axis — keeps inference efficient while preserving audio fidelity.
+
+### Reinforcement Learning Alignment
+
+S2 uses Group Relative Policy Optimization (GRPO) for post-training alignment. The same models used to filter and annotate training data are directly reused as reward models during RL — eliminating distribution mismatch between pre-training data and post-training objectives. The reward signal combines semantic accuracy, instruction adherence, acoustic preference scoring, and timbre similarity.
+
+### Production Streaming via SGLang
+
+Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, S2 directly inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.
+
+On a single NVIDIA H200 GPU:
+
+- **Real-Time Factor (RTF):** 0.195
+- **Time-to-first-audio:** ~100 ms
+- **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
 
 ### Multilingual Support
 
-Fish Audio S2 supports high-quality multilingual text-to-speech without the need for phonemes or language-specific preprocessing. Including:
+S2 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing. Including:
 
-**English, Chinese, Japanese, Korean, Arabic, German, French...**
+**English, Chinese, Japanese, Korean, Arabics, German, French...**
 
-**And more!**
+**AND MORE!**
 
-The list is constantly expanding, please check [Fish Audio](https://fish.audio/) for the latest releases.
+The list is constantly expanding, check [Fish Audio](https://fish.audio/) for the latest releases.
 
-### Native Multi-speaker Generation
+### Native Multi-Speaker Generation
 
 <img src="../assets/chattemplate.png" width=200%>
 
-Fish Audio S2 allows users to upload reference audio containing multiple speakers, and the model will process each speaker's characteristics through the `<|speaker:i|>` token. You can then control the model's performance via speaker ID tokens, achieving multiple speakers in a single generation. No more need to upload reference audio and generate speech for each speaker individually.
+Fish Audio S2 allows users to upload reference audio with multi-speaker, the model will deal with every speaker's feature via `<|speaker:i|>` token. Then you can control the model's performance with the speaker id token, allowing a single generation to include multiple speakers. You no longer need to upload reference audio separately for each speaker.
 
-### Multi-round Dialogue Generation
+### Multi-Turn Generation
 
-Thanks to the expansion of the model's context, our model can now use the information from the previous context to improve the expressiveness of the subsequent generated content, thereby enhancing the naturalness of the content.
+Thanks to the expansion of the model context, our model can now use previous information to improve the expressiveness of subsequent generated content, thereby increasing the naturalness of the content.
 
-### Fast Voice Cloning
+### Rapid Voice Cloning
 
-Fish Audio S2 supports accurate voice cloning using short reference samples (typically 10-30 seconds). The model can capture timbre, speaking style, and emotional tendency, generating realistic and consistent cloned voices without additional fine-tuning.
+Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
 Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the SGLang server.
-
 ---
 
-## Acknowledgements
+## Credits
 
 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
 - [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
@@ -130,8 +167,7 @@ Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni
 - [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
 - [Qwen3](https://github.com/QwenLM/Qwen3)
 
-## Technical Report
-
+## Tech Report
 ```bibtex
 @misc{fish-speech-v1.4,
       title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},

+ 3 - 3
docs/en/install.md

@@ -62,14 +62,14 @@ pip install -e .
 ```
 
 !!! warning
-    The `compile` option is not supported on windows and macOS, if you want to run with compile, you need to install trition by yourself.
+    The `compile` option is not supported on Windows and macOS. If you want to run with compile, you need to install Triton manually.
 
 
 ## Docker Setup
 
 Fish Audio S2 series model provides multiple Docker deployment options to suit different needs. You can use pre-built images from Docker Hub, build locally with Docker Compose, or manually build custom images.
 
-We provided Docker images for both WebUI and API server on both GPU(CUDA126 for default) and CPU. You can use the pre-built images from Docker Hub, or build locally with Docker Compose, or manually build custom images. If you want to build locally, follow the instructions below. If you just want to use the pre-built images, follow [inference guide](en/inference.md) to use directly.
+We provide Docker images for both WebUI and API server on both GPU (CUDA126 by default) and CPU. You can use the pre-built images from Docker Hub, build locally with Docker Compose, or manually build custom images. If you want to build locally, follow the instructions below. If you only want to use pre-built images, follow the [inference guide](inference.md).
 
 ### Prerequisites
 
@@ -115,7 +115,7 @@ API_PORT=8080            # API server port
 UV_VERSION=0.8.15        # UV package manager version
 ```
 
-The comand will build the image and run the container. You can access the WebUI at `http://localhost:7860` and the API server at `http://localhost:8080`.
+The command will build the image and run the container. You can access the WebUI at `http://localhost:7860` and the API server at `http://localhost:8080`.
 
 ### Manual Docker Build
 

+ 74 - 37
docs/ja/index.md

@@ -1,7 +1,7 @@
 <div align="center">
 <h1>Fish Speech</h1>
 
-[English](../en/) | [简体中文](../zh/) | [Portuguese](../pt/) | **日本語** | [한국어](../ko/) | [العربية](../ar/) <br>
+<p><a href="../en/">English</a> | <a href="../zh/">简体中文</a> | <a href="../pt/">Portuguese</a> | <strong>日本語</strong> | <a href="../ko/">한국어</a> | <a href="../ar/">العربية</a></p>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
@@ -30,14 +30,14 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
+        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/spaces/fishaudio/fish-speech-1">
-        <img alt="Huggingface" src="https://img.shields.io/badge/🤗%20-space%20demo-yellow"/>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
-        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
     </a>
 </div>
 
@@ -47,80 +47,118 @@
 !!! warning "法的免責事項"
     私たちは、コードベースのいかなる違法な使用に対しても責任を負いません。DMCA およびその他の関連法に関する現地の規制を参照してください。
 
-## ここから始める
+## クイックスタート
+
+### まずはドキュメントから
 
-これは Fish Speech の公式ドキュメントです。説明に従って簡単に使い始めることができます。
+Fish Audio S2 の公式ドキュメントです。以下からすぐに始められます。
 
-- [インストール](install.md)
-- [コマンドライン推論](inference.md)
-- [WebUI 推論](inference.md)
-- [サーバー推論](server.md)
-- [Docker セットアップ](install.md)
+- [インストール](https://speech.fish.audio/ja/install/)
+- [コマンドライン推論](https://speech.fish.audio/ja/inference/)
+- [WebUI 推論](https://speech.fish.audio/ja/inference/)
+- [サーバー推論](https://speech.fish.audio/ja/server/)
+- [Docker セットアップ](https://speech.fish.audio/ja/install/)
 
-!!! note
-    SGLang サーバーについては [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。
+> [!IMPORTANT]
+> **SGLang サーバーについては [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。**
 
 ### LLM Agent 向け
 
-```text
+```
 https://speech.fish.audio/ja/install/ の手順に従って、Fish Audio S2 をインストール・設定してください。
 ```
 
 ## Fish Audio S2
-**オープンソースおよびクローズドソースの中で最高峰のテキスト読み上げシステム**
-
-Fish Audio S2 は [Fish Audio](https://fish.audio/) によって開発された最新のモデルで、自然でリアル、かつ感情豊かな音声を生成するように設計されています。機械的でも平坦でもなく、スタジオスタイルの朗読に限定されません。
+**オープンソースおよびクローズドソースの中で最も優れたテキスト読み上げシステム**
 
-Fish Audio S2 は日常会話に焦点を当てており、ネイティブなマルチ話者およびマルチターン生成をサポートしています。また、指示制御もサポートしています。
+Fish Audio S2 は [Fish Audio](https://fish.audio/) が開発した最新モデルです。約 50 言語・1,000 万時間超の音声データで学習され、強化学習アラインメントと Dual-Autoregressive アーキテクチャを組み合わせることで、自然でリアルかつ感情表現豊かな音声を生成します。
 
-S2 シリーズには複数のモデルが含まれており、オープンソースモデルは S2-Pro で、シリーズの中で最も強力なモデルです。
+S2 は `[laugh]`、`[whispers]`、`[super happy]` といった自然言語タグで、韻律や感情を文中の任意位置で細かく制御できます。さらに、マルチスピーカー生成とマルチターン生成にもネイティブ対応しています。
 
-リアルタイム体験については、[Fish Audio Webサイト](https://fish.audio/) をご覧ください。
+ライブデモは [Fish Audio ウェブサイト](https://fish.audio/) から、詳細は [ブログ記事](https://fish.audio/blog/fish-audio-open-sources-s2/) をご覧ください。
 
 ### モデルバリアント
 
 | モデル | サイズ | 利用可能性 | 説明 |
 |------|------|-------------|-------------|
-| S2-Pro | 4B パラメータ | [huggingface](https://huggingface.co/fishaudio/s2-pro) | 最高品質と安定性を備えたフル機能のフラッグシップモデル |
+| S2-Pro | 4B パラメータ | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 品質と安定性を最大化したフル機能のフラッグシップモデル |
 
-モデルの詳細については、技術レポートを参照してください。
+モデルの詳細は[技術レポート](https://arxiv.org/abs/2411.01156)をご参照ください。
+
+## ベンチマーク結果
+
+| ベンチマーク | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER(中国語) | **0.54%**(全体最良) |
+| Seed-TTS Eval — WER(英語) | **0.99%**(全体最良) |
+| Audio Turing Test(指示あり) | **0.515** 事後平均値 |
+| EmergentTTS-Eval — 勝率 | **81.88%**(全体最高) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — 品質 | **4.51 / 5.0** |
+| 多言語(MiniMax Testset)— 最良 WER | **24 言語中 11 言語** |
+| 多言語(MiniMax Testset)— 最良 SIM | **24 言語中 17 言語** |
+
+Seed-TTS Eval では、S2 はクローズドソースを含む全評価モデルの中で最小 WER を達成しました:Qwen3-TTS(0.77/1.24)、MiniMax Speech-02(0.99/1.90)、Seed-TTS(1.12/2.25)。Audio Turing Test では 0.515 を記録し、Seed-TTS(0.417)比で 24%、MiniMax-Speech(0.387)比で 33% 上回りました。EmergentTTS-Eval では、副言語情報(91.61%)、疑問文(84.41%)、統語的複雑性(83.39%)で特に高い成績を示しています。
 
 ## ハイライト
 
 <img src="../assets/totalability.png" width=200%>
 
-### 自然言語制御
+### 自然言語による細粒度インライン制御
+
+Fish Audio S2 では、テキスト内の特定の単語やフレーズ位置に自然言語の指示を直接埋め込むことで、音声生成を局所的に制御できます。固定の事前定義タグに依存するのではなく、S2 は [whisper in small voice]、[professional broadcast tone]、[pitch up] のような自由形式のテキスト記述を受け付け、単語レベルで表現をオープンエンドに制御できます。
+
+### 二重自己回帰(Dual-Autoregressive)アーキテクチャ
 
-Fish Audio S2 では、ユーザーが自然言語を使用して各文のパフォーマンス、副言語情報、感情、その他の音声特性を制御できます。短いタグを使用してモデルのパフォーマンスを曖昧に制御するだけでなく、生成されるコンテンツ全体の品質を大幅に向上させます。
+S2 はデコーダー専用 Transformer と RVQ ベースの音声コーデック(10 codebooks、約 21 Hz)を組み合わせています。Dual-AR は生成を 2 段階に分割します。
+
+- **Slow AR** は時間軸方向に動作し、主となる semantic codebook を予測。
+- **Fast AR** は各時刻で残り 9 個の residual codebook を生成し、細かな音響ディテールを復元。
+
+この非対称設計(時間軸 4B パラメータ、深さ軸 400M パラメータ)により、音質を保ちながら推論効率を高めています。
+
+### 強化学習アラインメント
+
+S2 は後学習アラインメントに Group Relative Policy Optimization(GRPO)を採用しています。学習データのフィルタリングとアノテーションに使った同一モデル群を、そのまま RL の報酬モデルとして再利用することで、事前学習データ分布と事後学習目的のミスマッチを抑制しています。報酬信号には、意味的正確性、指示追従性、音響的選好スコア、音色類似度が含まれます。
+
+### SGLang による本番向けストリーミング
+
+Dual-AR は構造的に標準的な自己回帰 LLM と同型のため、S2 は SGLang の LLM 向け最適化をそのまま活用できます。たとえば continuous batching、paged KV cache、CUDA graph replay、RadixAttention ベースの prefix caching です。
+
+単一の NVIDIA H200 GPU での実測:
+
+- **RTF(Real-Time Factor):** 0.195
+- **初回音声出力までの時間:** 約 100 ms
+- **スループット:** RTF 0.5 未満を維持しつつ 3,000+ acoustic tokens/s
 
 ### 多言語サポート
 
-Fish Audio S2 は、音素や特定の言語のプリプロセスを必要とせず、高品質な多言語テキスト読み上げをサポートしています。以下を含みます:
+Fish Audio S2 は、音素や言語固有の前処理を必要とせずに、高品質な多言語テキスト読み上げをサポートします。以下を含みます:
 
 **英語、中国語、日本語、韓国語、アラビア語、ドイツ語、フランス語...**
 
-**さらに追加予定!**
+**さらに多く!**
 
 リストは常に拡大しています。最新のリリースについては [Fish Audio](https://fish.audio/) を確認してください。
 
-### ネイティブマルチ話者生成
+### ネイティブなマルチスピーカー生成
 
 <img src="../assets/chattemplate.png" width=200%>
 
-Fish Audio S2 では、ユーザーが複数の話者を含むリファレンスオーディオをアップロードでき、モデルは `<|speaker:i|>` トークンを通じて各話者の特徴を処理します。その後、話者 ID トークンを介してモデルのパフォーマンスを制御し、1 回の生成で複数の話者を実現できます。話者ごとに個別にリファレンスオーディオをアップロードして音声を生成する必要はもうありません。
+Fish Audio S2 では、ユーザーが複数のスピーカーを含む参照オーディオをアップロードでき、モデルは `<|speaker:i|>` トークンを介して各スピーカーの特徴を処理します。その後、スピーカーIDトークンを使用してモデルのパフォーマンスを制御し、1回の生成で複数のスピーカーを含めることができます。以前のように各スピーカーに対して個別に参照オーディオをアップロードして音声を生成する必要はもうありません。
 
 ### マルチターン対話生成
 
-モデルのコンテキストの拡張により、以前のコンテキストの情報を使用して、その後に生成されるコンテンツの表現力を向上させ、コンテンツの自然度を高めることができるようになりました。
+モデルのコンテキストの拡張により、以前の情報を使用して後続の生成されたコンテンツの表現力を向上させ、コンテンツの自然さを高めることができるようになりました。
 
-### 高速音声クローン
+### 高速音声クロー
 
-Fish Audio S2 は、短いリファレンスサンプル(通常 10〜30 秒)を使用した正確な音声クローンをサポートしています。モデルは音色、話し方、感情的な傾向を捉えることができ、追加の微調整なしでリアルで一貫したクローン音声を生成できます。
+Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成ます。
 SGLang サーバーの利用については https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md を参照してください。
 
 ---
 
-## 謝辞
+## クレジット
 
 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
 - [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
@@ -130,8 +168,7 @@ SGLang サーバーの利用については https://github.com/sgl-project/sglan
 - [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
 - [Qwen3](https://github.com/QwenLM/Qwen3)
 
-## 技術報告
-
+## 技術レポート
 ```bibtex
 @misc{fish-speech-v1.4,
       title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},

+ 76 - 39
docs/ko/index.md

@@ -1,7 +1,7 @@
 <div align="center">
 <h1>Fish Speech</h1>
 
-[English](../en/) | [简体中文](../zh/) | [Portuguese](../pt/) | [日本語](../ja/) | **한국어** | [العربية](../ar/) <br>
+<p><a href="../en/">English</a> | <a href="../zh/">简体中文</a> | <a href="../pt/">Portuguese</a> | <a href="../ja/">日本語</a> | <strong>한국어</strong> | <a href="../ar/">العربية</a></p>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
@@ -30,14 +30,14 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
+        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/spaces/fishaudio/fish-speech-1">
-        <img alt="Huggingface" src="https://img.shields.io/badge/🤗%20-space%20demo-yellow"/>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
-        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
     </a>
 </div>
 
@@ -47,80 +47,118 @@
 !!! warning "법적 면책 조항"
     코드베이스의 불법적인 사용에 대해 당사는 어떠한 책임도 지지 않습니다. DMCA 및 기타 관련 법률에 관한 현지 규정을 참조하십시오.
 
-## 시작하기
+## 빠른 시작
+
+### 문서로 바로 시작하기
 
-Fish Speech의 공식 문서입니다. 지침에 따라 쉽게 시작할 수 있습니다.
+Fish Audio S2 공식 문서입니다. 아래 링크에서 바로 시작할 수 있습니다.
 
-- [설치](install.md)
-- [커맨드라인 추론](inference.md)
-- [WebUI 추론](inference.md)
-- [서버 추론](server.md)
-- [Docker 설정](install.md)
+- [설치](https://speech.fish.audio/ko/install/)
+- [커맨드라인 추론](https://speech.fish.audio/ko/inference/)
+- [WebUI 추론](https://speech.fish.audio/ko/inference/)
+- [서버 추론](https://speech.fish.audio/ko/server/)
+- [Docker 설정](https://speech.fish.audio/ko/install/)
 
-!!! note
-    SGLang 서버는 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)를 참고하세요.
+> [!IMPORTANT]
+> **SGLang 서버는 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)를 참고하세요.**
 
 ### LLM Agent 가이드
 
-```text
+```
 https://speech.fish.audio/ko/install/ 문서를 따라 Fish Audio S2를 설치하고 구성하세요.
 ```
 
 ## Fish Audio S2
-**오픈 소스 및 클로즈드 소스 중 최고봉의 텍스트 음성 변환 시스템**
-
-Fish Audio S2는 [Fish Audio](https://fish.audio/)에서 개발한 최신 모델로, 자연스럽고 사실적이며 감정이 풍부한 음성을 생성하도록 설계되었습니다. 기계적이거나 평면적이지 않으며, 스튜디오 스타일의 낭독에 국한되지 않습니다.
+**오픈 소스와 클로즈드 소스 모두에서 가장 뛰어난 텍스트 음성 변환 시스템**
 
-Fish Audio S2는 일상 대화에 중점을 두고 있으며, 네이티브 다중 화자 및 다중 턴 생성을 지원합니다. 또한 명령 제어를 지원합니다.
+Fish Audio S2는 [Fish Audio](https://fish.audio/)가 개발한 최신 모델입니다. 약 50개 언어, 1,000만 시간 이상의 오디오 데이터로 학습되었고, 강화학습 정렬과 Dual-Autoregressive 아키텍처를 결합해 자연스럽고 사실적이며 감정 표현이 풍부한 음성을 생성합니다.
 
-S2 시리즈에는 여러 모델이 포함되어 있으며, 오픈 소스 모델은 S2-Pro로, 시리즈 중에서 가장 강력한 모델입니다.
+S2는 `[laugh]`, `[whispers]`, `[super happy]` 같은 자연어 태그를 사용해 운율과 감정을 문장 내부에서 세밀하게 제어할 수 있으며, 멀티 화자/멀티 턴 생성도 네이티브로 지원합니다.
 
-실시간 체험은 [Fish Audio 웹사이트](https://fish.audio/)를 방문해 주세요.
+실시간 데모는 [Fish Audio 웹사이트](https://fish.audio/)에서, 자세한 내용은 [블로그 글](https://fish.audio/blog/fish-audio-open-sources-s2/)에서 확인할 수 있습니다.
 
 ### 모델 변형
 
 | 모델 | 크기 | 가용성 | 설명 |
 |------|------|-------------|-------------|
-| S2-Pro | 4B 매개변수 | [huggingface](https://huggingface.co/fishaudio/s2-pro) | 최고의 품질과 안정성을 갖춘 풀 기능 플래그십 모델 |
+| S2-Pro | 4B 매개변수 | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 최고 수준의 품질과 안정성을 제공하는 풀기능 플래그십 모델 |
 
-모델에 대한 자세한 내용은 기술 보고서를 참조하십시오.
+모델 상세는 [기술 보고서](https://arxiv.org/abs/2411.01156)를 참고하세요.
 
-## 하이라이트
+## 벤치마크 결과
+
+| 벤치마크 | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (중국어) | **0.54%** (전체 최고) |
+| Seed-TTS Eval — WER (영어) | **0.99%** (전체 최고) |
+| Audio Turing Test (지시 포함) | **0.515** 사후 평균 |
+| EmergentTTS-Eval — 승률 | **81.88%** (전체 최고) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — 품질 | **4.51 / 5.0** |
+| 다국어 (MiniMax Testset) — 최고 WER | **24개 언어 중 11개** |
+| 다국어 (MiniMax Testset) — 최고 SIM | **24개 언어 중 17개** |
+
+Seed-TTS Eval에서 S2는 클로즈드 소스 시스템을 포함한 전체 비교 모델 중 가장 낮은 WER를 기록했습니다: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). Audio Turing Test에서는 0.515를 기록해 Seed-TTS (0.417) 대비 24%, MiniMax-Speech (0.387) 대비 33% 높았습니다. EmergentTTS-Eval에서는 파라언어 표현(91.61%), 의문문(84.41%), 구문 복잡도(83.39%)에서 특히 강한 성능을 보였습니다.
+
+## 주요 특징
 
 <img src="../assets/totalability.png" width=200%>
 
-### 자연어 제어
+### 자연어 기반 세밀한 인라인 제어
+
+Fish Audio S2는 텍스트의 특정 단어 또는 구문 위치에 자연어 지시를 직접 삽입해 음성 생성을 국소적으로 제어할 수 있습니다. 고정된 사전 정의 태그에 의존하는 대신, S2는 [whisper in small voice], [professional broadcast tone], [pitch up] 같은 자유 형식 텍스트 설명을 받아 단어 수준의 개방형 표현 제어를 지원합니다.
+
+### Dual-Autoregressive 아키텍처
 
-Fish Audio S2를 사용하면 사용자가 자연어를 사용하여 각 문장의 퍼포먼스, 부언어 정보, 감정 및 기타 음성 특성을 제어할 수 있습니다. 짧은 태그를 사용하여 모델의 퍼포먼스를 모호하게 제어하는 것뿐만 아니라 생성된 콘텐츠 전체의 품질을 크게 향상시킵니다.
+S2는 decoder-only Transformer와 RVQ 기반 오디오 코덱(10 codebooks, 약 21 Hz 프레임레이트)을 결합합니다. Dual-AR은 생성 과정을 두 단계로 나눕니다.
+
+- **Slow AR**: 시간축을 따라 동작하며 주 semantic codebook을 예측
+- **Fast AR**: 각 시점에서 나머지 9개 residual codebook을 생성해 세밀한 음향 디테일을 복원
+
+이 비대칭 설계(시간축 4B 파라미터, 깊이축 400M 파라미터)는 음질을 유지하면서 추론 효율을 높입니다.
+
+### 강화학습 정렬
+
+S2는 후학습 정렬을 위해 Group Relative Policy Optimization(GRPO)을 사용합니다. 학습 데이터 필터링/라벨링에 쓰인 동일한 모델을 RL 보상 모델로 재사용해, 사전학습 데이터 분포와 후학습 목표 간의 분포 불일치를 줄였습니다. 보상 신호는 의미 정확도, 지시 준수도, 음향 선호 점수, 음색 유사도를 함께 반영합니다.
+
+### SGLang 기반 프로덕션 스트리밍
+
+Dual-AR 구조는 표준 자기회귀 LLM과 구조적으로 동형이기 때문에, S2는 SGLang의 LLM 서빙 최적화를 그대로 활용합니다. 예: continuous batching, paged KV cache, CUDA graph replay, RadixAttention 기반 prefix caching.
+
+NVIDIA H200 단일 GPU 기준:
+
+- **실시간 계수(RTF):** 0.195
+- **첫 오디오 출력까지 시간:** 약 100 ms
+- **처리량:** RTF 0.5 미만 유지 시 3,000+ acoustic tokens/s
 
 ### 다국어 지원
 
-Fish Audio S2는 음소나 특정 언어의 전처리 없이도 고품질의 다국어 텍스트 음성 변환을 지원합니다. 다음을 포함합니다:
+Fish Audio S2는 음소나 언어별 전처리 없이 고품질 다국어 텍스트 음성 변환을 지원합니다. 포함 사항:
 
 **영어, 중국어, 일본어, 한국어, 아랍어, 독일어, 프랑스어...**
 
-**그리고 더욱 추가될 예정입니다!**
+**그리고 더 많이!**
 
-목록은 지속적으로 확대되고 있으며, 최신 릴리스는 [Fish Audio](https://fish.audio/)를 확인하십시오.
+목록은 계속 확장되고 있습니다. 최신 릴리스는 [Fish Audio](https://fish.audio/)를 확인하세요.
 
-### 네이티브 다중 화자 생성
+### 네이티브 멀티 화자 생성
 
 <img src="../assets/chattemplate.png" width=200%>
 
-Fish Audio S2를 사용하면 사용자가 여러 화자가 포함된 참조 오디오를 업로드할 수 있으며, 모델은 `<|speaker:i|>` 토큰을 통해 각 화자의 특성을 처리합니다. 이후 화자 ID 토큰을 통해 모델의 퍼포먼스를 제어하여 한 번의 생성으로 여러 화자를 구현할 수 있습니다. 화자마다 개별적으로 참조 오디오를 업로드하고 음성을 생성할 필요가 더 이상 없습니다.
+Fish Audio S2는 사용자가 여러 화자가 포함된 참조 오디오를 업로드할 수 있도록 하며, 모델은 `<|speaker:i|>` 토큰을 통해 각 화자의 특징을 처리합니다. 그런 다음 화자 ID 토큰으로 모델의 성능을 제어하여 한 번의 생성으로 여러 화자를 포함할 수 있습니다. 이전처럼 각 화자마다 별도로 참조 오디오를 업로드하고 음성을 생성할 필요가 없습니다.
 
-### 다중 턴 대화 생성
+### 멀티 턴 대화 생성
 
-모델 컨텍스트의 확장 덕분에, 이전 컨텍스트의 정보를 사용하여 이후에 생성되는 콘텐츠의 표현력을 개선하고 콘텐츠의 자연스러움을 높일 수 있게 되었습니다.
+모델 컨텍스트의 확장 덕분에 이제 이전 정보를 활용하여 후속 생성 콘텐츠의 표현력을 높이고 콘텐츠의 자연스러움을 향상시킬 수 있습니다.
 
-### 빠른 음성 클로닝
+### 빠른 음성 복제
 
-Fish Audio S2는 짧은 참조 샘플(보통 10~30초)을 사용한 정확한 음성 클로닝을 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 포착할 수 있으며, 추가 미세 조정 없이도 사실적이고 일관된 클로닝 음성을 생성할 수 있습니다.
+Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
 SGLang 서버 사용은 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 를 참고하세요.
 
 ---
 
-## 감사의 인사
+## 크레딧
 
 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
 - [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
@@ -131,7 +169,6 @@ SGLang 서버 사용은 https://github.com/sgl-project/sglang-omni/blob/main/sgl
 - [Qwen3](https://github.com/QwenLM/Qwen3)
 
 ## 기술 보고서
-
 ```bibtex
 @misc{fish-speech-v1.4,
       title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},

+ 74 - 37
docs/pt/index.md

@@ -1,7 +1,7 @@
 <div align="center">
 <h1>Fish Speech</h1>
 
-[English](../en/) | [简体中文](../zh/) | **Portuguese** | [日本語](../ja/) | [한국어](../ko/) | [العربية](../ar/) <br>
+<p><a href="../en/">English</a> | <a href="../zh/">简体中文</a> | <strong>Portuguese</strong> | <a href="../ja/">日本語</a> | <a href="../ko/">한국어</a> | <a href="../ar/">العربية</a></p>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
@@ -30,14 +30,14 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
+        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/spaces/fishaudio/fish-speech-1">
-        <img alt="Huggingface" src="https://img.shields.io/badge/🤗%20-space%20demo-yellow"/>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/s2-pro">
-        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
     </a>
 </div>
 
@@ -47,80 +47,118 @@
 !!! warning "Isenção de Responsabilidade Legal"
     Não nos responsabilizamos por qualquer uso ilegal da base de códigos. Consulte as regulamentações locais sobre DMCA e outras leis relacionadas.
 
-## Começar
+## Início Rápido
+
+### Comece pela documentação
 
-Esta é a documentação oficial do Fish Speech. Siga as instruções para começar facilmente.
+Esta é a documentação oficial do Fish Audio S2. Você pode começar por aqui:
 
-- [Instalação](install.md)
-- [Inferência por Linha de Comando](inference.md)
-- [Inferência WebUI](inference.md)
-- [Inferência via Servidor](server.md)
-- [Configuração Docker](install.md)
+- [Instalação](https://speech.fish.audio/pt/install/)
+- [Inferência por Linha de Comando](https://speech.fish.audio/pt/inference/)
+- [Inferência WebUI](https://speech.fish.audio/pt/inference/)
+- [Inferência via Servidor](https://speech.fish.audio/pt/server/)
+- [Configuração Docker](https://speech.fish.audio/pt/install/)
 
-!!! note
-    Para servidor com SGLang, consulte o [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).
+> [!IMPORTANT]
+> **Para servidor com SGLang, consulte o [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).**
 
 ### Guia para agentes LLM
 
-```text
+```
 Instale e configure o Fish Audio S2 seguindo as instruções em https://speech.fish.audio/pt/install/ .
 ```
 
 ## Fish Audio S2
-**O melhor sistema de texto para fala em código aberto e código fechado**
-
-O Fish Audio S2 é o modelo mais recente desenvolvido pela [Fish Audio](https://fish.audio/), projetado para gerar fala que soe natural, autêntica e emocionalmente rica — não mecânica, monótona ou confinada à leitura em estúdio.
+**O melhor sistema de conversão de texto em fala entre código aberto e código fechado**
 
-O Fish Audio S2 foca em conversas cotidianas, suportando geração nativa de múltiplos locutores e múltiplos turnos. Também suporta controle por instruções.
+O Fish Audio S2 é o modelo mais recente da [Fish Audio](https://fish.audio/). Treinado com mais de 10 milhões de horas de áudio em cerca de 50 idiomas, o S2 combina alinhamento por reforço com uma arquitetura Dual-Autoregressive para gerar fala natural, realista e emocionalmente expressiva.
 
-A série S2 inclui vários modelos. O modelo de código aberto é o S2-Pro, que é o modelo mais poderoso da série.
+O S2 permite controle fino de prosódia e emoção dentro da própria frase com tags em linguagem natural, como `[laugh]`, `[whispers]` e `[super happy]`, além de oferecer suporte nativo a múltiplos falantes e múltiplos turnos.
 
-Visite o [site da Fish Audio](https://fish.audio/) para uma experiência em tempo real.
+Acesse o [site da Fish Audio](https://fish.audio/) para testar ao vivo e leia o [post no blog](https://fish.audio/blog/fish-audio-open-sources-s2/) para mais detalhes.
 
 ### Variantes do Modelo
 
 | Modelo | Tamanho | Disponibilidade | Descrição |
 |------|------|-------------|-------------|
-| S2-Pro | 4B Parâmetros | [huggingface](https://huggingface.co/fishaudio/s2-pro) | Modelo emblemático completo com a mais alta qualidade e estabilidade |
+| S2-Pro | 4B parâmetros | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Modelo carro-chefe completo com máxima qualidade e estabilidade |
 
-Para mais detalhes sobre os modelos, consulte o relatório técnico.
+Mais detalhes podem ser encontrados no [relatório técnico](https://arxiv.org/abs/2411.01156).
+
+## Resultados de Benchmark
+
+| Benchmark | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER (Chinês) | **0.54%** (melhor geral) |
+| Seed-TTS Eval — WER (Inglês) | **0.99%** (melhor geral) |
+| Audio Turing Test (com instrução) | **0.515** média a posteriori |
+| EmergentTTS-Eval — Taxa de vitória | **81.88%** (maior geral) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — Qualidade | **4.51 / 5.0** |
+| Multilíngue (MiniMax Testset) — Melhor WER | **11 de 24** idiomas |
+| Multilíngue (MiniMax Testset) — Melhor SIM | **17 de 24** idiomas |
+
+No Seed-TTS Eval, o S2 obteve o menor WER entre todos os modelos avaliados, incluindo sistemas fechados: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90) e Seed-TTS (1.12/2.25). No Audio Turing Test, o valor 0.515 supera o Seed-TTS (0.417) em 24% e o MiniMax-Speech (0.387) em 33%. No EmergentTTS-Eval, o S2 se destacou especialmente em paralinguística (91.61%), perguntas (84.41%) e complexidade sintática (83.39%).
 
 ## Destaques
 
 <img src="../assets/totalability.png" width=200%>
 
-### Controle por Linguagem Natural
+### Controle Inline Refinado via Linguagem Natural
+
+O Fish Audio S2 permite controle localizado da geração de fala ao incorporar instruções em linguagem natural diretamente em posições específicas de palavras ou frases no texto. Em vez de depender de um conjunto fixo de tags predefinidas, o S2 aceita descrições textuais livres, como [whisper in small voice], [professional broadcast tone] ou [pitch up], permitindo controle de expressão aberto no nível da palavra.
+
+### Arquitetura Dual-Autoregressive
 
-O Fish Audio S2 permite que os usuários usem linguagem natural para controlar o desempenho, informações paralinguísticas, emoções e outras características de voz de cada frase, em vez de usar apenas tags curtas para controlar vagamente o desempenho do modelo. Isso aumenta muito a qualidade geral do conteúdo gerado.
+O S2 é baseado em um transformer apenas decodificador, combinado com um codec de áudio RVQ (10 codebooks, ~21 Hz de taxa de quadros). A arquitetura Dual-AR divide a geração em duas etapas:
+
+- **Slow AR** opera no eixo temporal e prevê o codebook semântico principal.
+- **Fast AR** gera os 9 codebooks residuais restantes em cada passo de tempo, reconstruindo detalhes acústicos finos.
+
+Esse desenho assimétrico (4B parâmetros no eixo temporal e 400M no eixo de profundidade) mantém a inferência eficiente sem sacrificar fidelidade de áudio.
+
+### Alinhamento por Reforço
+
+O S2 usa Group Relative Policy Optimization (GRPO) no pós-treinamento. Os mesmos modelos usados para filtrar e anotar dados de treino são reutilizados diretamente como modelos de recompensa no RL, eliminando o desalinhamento de distribuição entre os dados de pré-treinamento e os objetivos de pós-treinamento. O sinal de recompensa combina precisão semântica, aderência à instrução, preferência acústica e similaridade de timbre.
+
+### Streaming em Produção com SGLang
+
+Como a arquitetura Dual-AR é estruturalmente isomórfica a LLMs autoregressivos padrão, o S2 herda diretamente as otimizações nativas de serving do SGLang, incluindo continuous batching, paged KV cache, CUDA graph replay e prefix caching com RadixAttention.
+
+Em uma única NVIDIA H200:
+
+- **RTF (Real-Time Factor):** 0.195
+- **Tempo até o primeiro áudio:** ~100 ms
+- **Throughput:** mais de 3.000 acoustic tokens/s mantendo RTF abaixo de 0.5
 
 ### Suporte Multilíngue
 
-O Fish Audio S2 suporta conversão de texto em fala multilíngue de alta qualidade sem a necessidade de fonemas ou pré-processamento específico por idioma. Incluindo:
+O Fish Audio S2 oferece suporte a conversão de texto em fala multilíngue de alta qualidade sem a necessidade de fonemas ou processamento específico de idioma. Incluindo:
 
 **Inglês, Chinês, Japonês, Coreano, Árabe, Alemão, Francês...**
 
-**E muito mais!**
+**E MUITO MAIS!**
 
-A lista está em constante expansão. Verifique a [Fish Audio](https://fish.audio/) para os lançamentos mais recentes.
+A lista está em constante expansão, verifique o [Fish Audio](https://fish.audio/) para os lançamentos mais recentes.
 
-### Geração Nativa de Múltiplos Locutores
+### Geração Nativa de Múltiplos Falantes
 
 <img src="../assets/chattemplate.png" width=200%>
 
-O Fish Audio S2 permite que os usuários carreguem áudio de referência contendo múltiplos locutores, e o modelo processará as características de cada locutor por meio do token `<|speaker:i|>`. Você pode então controlar o desempenho do modelo por meio de tokens de ID de locutor, alcançando múltiplos locutores em uma única geração. Não há mais necessidade de carregar áudio de referência e gerar fala para cada locutor individualmente.
+O Fish Audio S2 permite enviar um áudio de referência com vários falantes; o modelo processa as características de cada voz por meio do token `<|speaker:i|>`. Depois, você controla o comportamento do modelo com o token de ID do falante, permitindo incluir várias vozes em uma única geração. Assim, não é mais necessário subir um áudio de referência separado para cada falante.
 
-### Geração de Diálogos em Múltiplos Turnos
+### Geração de Múltiplos Turnos
 
-Graças à expansão do contexto do modelo, nosso modelo agora pode usar as informações das partes anteriores do diálogo para melhorar a expressividade do conteúdo gerado subsequentemente, aumentando assim a naturalidade do conteúdo.
+Graças à extensão do contexto do modelo, nosso modelo agora pode usar informações anteriores para melhorar a expressividade e a naturalidade dos conteúdos gerados subsequentemente.
 
 ### Clonagem de Voz Rápida
 
-O Fish Audio S2 suporta clonagem de voz precisa usando amostras de referência curtas (geralmente de 10 a 30 segundos). O modelo pode capturar timbre, estilo de fala e tendência emocional, gerando vozes clonadas realistas e consistentes sem ajuste fino adicional.
+O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
 Para usar o servidor SGLang, consulte https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
 
 ---
 
-## Agradecimentos
+## Créditos
 
 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
 - [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
@@ -131,7 +169,6 @@ Para usar o servidor SGLang, consulte https://github.com/sgl-project/sglang-omni
 - [Qwen3](https://github.com/QwenLM/Qwen3)
 
 ## Relatório Técnico
-
 ```bibtex
 @misc{fish-speech-v1.4,
       title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},

+ 65 - 27
docs/zh/index.md

@@ -1,7 +1,7 @@
 <div align="center">
 <h1>Fish Speech</h1>
 
-[English](../README.md) | **简体中文** | [Portuguese](../README.pt-BR.md) | [日本語](../README.ja.md) | [한국어](../README.ko.md) | [العربية](../README.ar.md) <br>
+<p><a href="../en/">English</a> | <strong>简体中文</strong> | <a href="../pt/">Portuguese</a> | <a href="../ja/">日本語</a> | <a href="../ko/">한국어</a> | <a href="../ar/">العربية</a></p>
 
 <a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish&#0045;audio&#0045;s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish&#0032;Audio&#0032;S1 - Expressive&#0032;Voice&#0032;Cloning&#0032;and&#0032;Text&#0045;to&#0045;Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
 <a href="https://trendshift.io/repositories/7014" target="_blank">
@@ -30,14 +30,14 @@
 </div>
 
 <div align="center">
-    <a target="_blank" href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2">
-      <img alt="TTS-Arena2 Score" src="https://img.shields.io/badge/TTS_Arena2-Rank_%231-gold?style=flat-square&logo=trophy&logoColor=white">
+    <a target="_blank" href="https://huggingface.co/fishaudio/s2">
+        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/spaces/fishaudio/fish-speech-1">
-        <img alt="Huggingface" src="https://img.shields.io/badge/🤗%20-space%20demo-yellow"/>
+    <a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
+        <img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
     </a>
-    <a target="_blank" href="https://huggingface.co/fishaudio/openaudio-s1-mini">
-        <img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
+    <a target="_blank" href="https://github.com/fishaudio/fish-speech/blob/main/FishAudioS2TecReport.pdf">
+        <img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Tecnical_Report-b31b1b?style=flat-square"/>
     </a>
 </div>
 
@@ -47,51 +47,89 @@
 !!! warning "法律免责声明"
     我们不对代码库的任何非法使用承担责任。请参考您当地关于 DMCA 和其他相关法律的法规。
 
-## 从这里开始
+## 快速开始
+
+### 文档入口
 
-这里是 Fish Speech 的官方文档,请按照说明轻松入门。
+这里是 Fish Audio S2 的官方文档,请按照说明轻松入门。
 
-- [安装](install.md)
-- [命令行推理](inference.md)
-- [WebUI 推理](inference.md)
-- [服务端推理](server.md)
-- [Docker 部署](install.md)
+- [安装](https://speech.fish.audio/zh/install/)
+- [命令行推理](https://speech.fish.audio/zh/inference/)
+- [WebUI 推理](https://speech.fish.audio/zh/inference/)
+- [服务端推理](https://speech.fish.audio/zh/server/)
+- [Docker 部署](https://speech.fish.audio/zh/install/)
 
-!!! note
-    如需 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)。
+> [!IMPORTANT]
+> **如需使用 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)。**
 
 ### LLM Agent 指南
 
-```text
+```
 请先阅读 https://speech.fish.audio/zh/install/ ,并按文档安装和配置 Fish Audio S2。
 ```
 
 ## Fish Audio S2
-**开源和闭源中最出色的文本转语音系统**
-
-Fish Audio S2 是由 [Fish Audio](https://fish.audio/) 开发的最新模型,旨在生成听起来自然、真实且情感丰富的语音——不机械、不平淡,也不局限于录音室风格的朗读。
+**在开源与闭源方案中都处于领先水平的文本转语音系统**
 
-Fish Audio S2 专注于日常对话,支持原生多说话人和多轮生成。同时支持指令控制
+Fish Audio S2 是由 [Fish Audio](https://fish.audio/) 开发的最新模型。S2 在约 50 种语言、超过 1000 万小时音频数据上完成训练,并结合强化学习对齐与双自回归架构,能够生成自然、真实且情感丰富的语音。
 
-S2 系列包含多个模型,开源模型为 S2-Pro,是该系列中性能最强的模型
+S2 支持通过自然语言标签(如 `[laugh]`、`[whispers]`、`[super happy]`)对韵律和情绪进行细粒度行内控制,同时原生支持多说话人和多轮生成
 
-请访问 [Fish Audio 网站](https://fish.audio/) 以获取实时体验。
+请访问 [Fish Audio 网站](https://fish.audio/) 体验在线演示,并阅读[博客文章](https://fish.audio/blog/fish-audio-open-sources-s2/)了解更多细节
 
 ### 模型变体
 
 | 模型 | 大小 | 可用性 | 描述 |
 |------|------|-------------|-------------|
-| S2-Pro | 4B 参数 | [huggingface](https://huggingface.co/fishaudio/s2-pro) | 功能齐全的旗舰模型,具有最高质量和稳定性 |
+| S2-Pro | 4B 参数 | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 功能齐全的旗舰模型,具有最高质量和稳定性 |
 
-有关模型的更多详情,请参见技术报告。
+有关模型的更多详情,请参见[技术报告](https://arxiv.org/abs/2411.01156)。
+
+## 基准测试结果
+
+| 基准 | Fish Audio S2 |
+|------|------|
+| Seed-TTS Eval — WER(中文) | **0.54%**(总体最佳) |
+| Seed-TTS Eval — WER(英文) | **0.99%**(总体最佳) |
+| Audio Turing Test(含指令) | **0.515** 后验均值 |
+| EmergentTTS-Eval — 胜率 | **81.88%**(总体最高) |
+| Fish Instruction Benchmark — TAR | **93.3%** |
+| Fish Instruction Benchmark — 质量 | **4.51 / 5.0** |
+| 多语言(MiniMax Testset)— 最佳 WER | **24** 种语言中的 **11** 种 |
+| 多语言(MiniMax Testset)— 最佳 SIM | **24** 种语言中的 **17** 种 |
+
+在 Seed-TTS Eval 上,S2 在所有已评估模型(包括闭源系统)中实现了最低 WER:Qwen3-TTS(0.77/1.24)、MiniMax Speech-02(0.99/1.90)、Seed-TTS(1.12/2.25)。在 Audio Turing Test 上,S2 的 0.515 相比 Seed-TTS(0.417)提升 24%,相比 MiniMax-Speech(0.387)提升 33%。在 EmergentTTS-Eval 中,S2 在副语言学(91.61% 胜率)、疑问句(84.41%)和句法复杂度(83.39%)等维度表现尤为突出。
 
 ## 亮点
 
 <img src="../assets/totalability.png" width=200%>
 
-### 自然语言控制
+### 通过自然语言进行细粒度行内控制
+
+Fish Audio S2 支持在文本中的特定词或短语位置直接嵌入自然语言指令,从而对语音生成进行局部控制。与依赖固定预设标签不同,S2 接受自由形式的文本描述,例如 [whisper in small voice]、[professional broadcast tone] 或 [pitch up],实现词级别的开放式表达控制。
+
+### 双自回归架构(Dual-Autoregressive)
+
+S2 基于仅解码器 Transformer,并结合 RVQ 音频编解码器(10 个码本,约 21 Hz 帧率)。Dual-AR 架构将生成拆分为两个阶段:
+
+- **Slow AR** 沿时间轴运行,预测主语义码本。
+- **Fast AR** 在每个时间步生成剩余 9 个残差码本,用于重建细粒度声学细节。
+
+这种非对称设计(时间轴 4B 参数、深度轴 400M 参数)在保持音频保真度的同时,提高了推理效率。
+
+### 强化学习对齐
+
+S2 使用 Group Relative Policy Optimization(GRPO)进行后训练对齐。用于过滤和标注训练数据的同一批模型被直接复用为 RL 的奖励模型,从而避免了预训练数据分布与后训练目标之间的不匹配。奖励信号综合了语义准确性、指令遵循、声学偏好评分与音色相似度。
+
+### 基于 SGLang 的生产级流式推理
+
+由于 Dual-AR 架构在结构上与标准自回归 LLM 同构,S2 可以直接继承 SGLang 提供的 LLM 原生服务优化能力,包括连续批处理、分页 KV Cache、CUDA Graph Replay 与基于 RadixAttention 的前缀缓存。
+
+在单张 NVIDIA H200 GPU 上:
 
-Fish Audio S2 允许用户使用自然语言去控制每一句内容的表现,副语言信息,情绪以及更多语音特征,而不单单局限于使用简短的标签去模糊地控制模型的表现,这极大的提高了生成内容整体的质量。
+- **实时因子(RTF):** 0.195
+- **首音频延迟:** 约 100 ms
+- **吞吐:** 在 RTF 低于 0.5 的情况下达到 3,000+ acoustic tokens/s
 
 ### 多语言支持
 

+ 158 - 11
docs/zh/install.md

@@ -1,14 +1,14 @@
 ## 系统要求
 
-- GPU 内存:24GB(推理)
+- GPU 显存:24GB(用于推理)
 - 系统:Linux、WSL
 
 ## 系统设置
 
-FishAudio S2支持多种安装方式,请选择最适合您开发环境的方法
+Fish Audio S2 支持多种安装方式。请选择最适合你当前开发环境的方案
 
-**先决条件**:安装用于音频处理的系统依赖项
-``` bash
+**前置依赖**:先安装音频处理所需的系统依赖
+```bash
 apt install portaudio19-dev libsox-dev ffmpeg
 ```
 
@@ -17,26 +17,173 @@ apt install portaudio19-dev libsox-dev ffmpeg
 ```bash
 conda create -n fish-speech python=3.12
 conda activate fish-speech
+
+# GPU 安装(选择 CUDA 版本:cu126、cu128、cu129)
+pip install -e .[cu129]
+
+# 仅 CPU 安装
+pip install -e .[cpu]
+
+# 默认安装(使用 PyTorch 默认索引)
 pip install -e .
-# 如果你没有安装上文的前两个依赖,这里会因为pyaudio无法安装而报错,可以考虑使用下面这一行指令。
-# conda install pyaudio 
-# 随后再次运行pip install -e .即可
+
+# 如果因 pyaudio 导致安装报错,可以先执行:
+# conda install pyaudio
+# 然后重新执行 pip install -e .
 ```
 
 ### UV
 
-UV 提供了更快的依赖解析和安装速度
+UV 可以更快地完成依赖解析与安装
 
 ```bash
-# GPU 安装 (选择您的 CUDA 版本: cu126, cu128, cu129)
+# GPU 安装(选择 CUDA 版本:cu126、cu128、cu129)
 uv sync --python 3.12 --extra cu129
 
 # 仅 CPU 安装
 uv sync --python 3.12 --extra cpu
 ```
 
+### Intel Arc XPU 支持
+
+如果你使用 Intel Arc GPU,可按以下方式安装 XPU 支持:
+
+```bash
+conda create -n fish-speech python=3.12
+conda activate fish-speech
+
+# 安装必需的 C++ 标准库
+conda install libstdcxx -c conda-forge
+
+# 安装支持 Intel XPU 的 PyTorch
+pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu
+
+# 安装 Fish Speech
+pip install -e .
+```
+
+!!! warning
+    `compile` 选项暂不支持 Windows 和 macOS。若你希望启用 compile,请手动安装 Triton。
+
 ## Docker 设置
 
-Fish Audio系列模型提供了多种 Docker 部署选项以满足不同需求。您可以使用 Docker Hub 上的预构建镜像,通过 Docker Compose 在本地构建,或手动构建自定义镜像。
+Fish Audio S2 系列模型提供多种 Docker 部署方式,适配不同场景。你可以直接使用 Docker Hub 预构建镜像,也可以用 Docker Compose 本地构建,或手动构建自定义镜像。
+
+我们提供 WebUI 与 API Server 的 GPU(默认 CUDA126)和 CPU 镜像。你可以直接用 Docker Hub 镜像,也可以在本地构建。如果你只想使用预构建镜像,请参考[inference guide](inference.md)。
+
+### 前置条件
+
+- 已安装 Docker 和 Docker Compose
+- (GPU 场景)已安装 NVIDIA Docker runtime
+- CUDA 推理建议至少 24GB 显存
+
+# 使用 Docker Compose
+
+如果你需要开发或自定义,推荐使用 Docker Compose 在本地构建并运行:
+
+```bash
+# 先克隆仓库
+git clone https://github.com/fishaudio/fish-speech.git
+cd fish-speech
+
+# 使用 CUDA 启动 WebUI
+docker compose --profile webui up
+
+# 启用 compile 优化启动 WebUI
+COMPILE=1 docker compose --profile webui up
+
+# 启动 API Server
+docker compose --profile server up
+
+# 启用 compile 优化启动 API Server
+COMPILE=1 docker compose --profile server up
+
+# 仅 CPU 部署
+BACKEND=cpu docker compose --profile webui up
+```
+
+#### Docker Compose 环境变量
+
+你可以通过环境变量定制部署参数:
+
+```bash
+# .env 文件示例
+BACKEND=cuda              # 或 cpu
+COMPILE=1                 # 启用 compile 优化
+GRADIO_PORT=7860          # WebUI 端口
+API_PORT=8080             # API Server 端口
+UV_VERSION=0.8.15         # UV 包管理器版本
+```
+
+命令执行后会自动构建镜像并启动容器。你可以通过 `http://localhost:7860` 访问 WebUI,通过 `http://localhost:8080` 访问 API Server。
+
+### 手动 Docker 构建
+
+如果你需要更细粒度的构建控制,可以手动构建:
+
+```bash
+# 构建支持 CUDA 的 WebUI 镜像
+docker build \
+    --platform linux/amd64 \
+    -f docker/Dockerfile \
+    --build-arg BACKEND=cuda \
+    --build-arg CUDA_VER=12.6.0 \
+    --build-arg UV_EXTRA=cu126 \
+    --target webui \
+    -t fish-speech-webui:cuda .
+
+# 构建支持 CUDA 的 API Server 镜像
+docker build \
+    --platform linux/amd64 \
+    -f docker/Dockerfile \
+    --build-arg BACKEND=cuda \
+    --build-arg CUDA_VER=12.6.0 \
+    --build-arg UV_EXTRA=cu126 \
+    --target server \
+    -t fish-speech-server:cuda .
+
+# 构建仅 CPU 镜像(支持多平台)
+docker build \
+    --platform linux/amd64,linux/arm64 \
+    -f docker/Dockerfile \
+    --build-arg BACKEND=cpu \
+    --target webui \
+    -t fish-speech-webui:cpu .
+
+# 构建开发镜像
+docker build \
+    --platform linux/amd64 \
+    -f docker/Dockerfile \
+    --build-arg BACKEND=cuda \
+    --target dev \
+    -t fish-speech-dev:cuda .
+```
+
+#### 构建参数
+
+- `BACKEND`:`cuda` 或 `cpu`(默认:`cuda`)
+- `CUDA_VER`:CUDA 版本(默认:`12.6.0`)
+- `UV_EXTRA`:UV 的 CUDA 扩展(默认:`cu126`)
+- `UBUNTU_VER`:Ubuntu 版本(默认:`24.04`)
+- `PY_VER`:Python 版本(默认:`3.12`)
+
+### 卷挂载
+
+两种方法都需要挂载以下目录:
+
+- `./checkpoints:/app/checkpoints` - 模型权重目录
+- `./references:/app/references` - 参考音频目录
+
+### 环境变量
+
+- `COMPILE=1` - 启用 `torch.compile`,可提升推理速度(约 10 倍)
+- `GRADIO_SERVER_NAME=0.0.0.0` - WebUI 服务地址
+- `GRADIO_SERVER_PORT=7860` - WebUI 服务端口
+- `API_SERVER_NAME=0.0.0.0` - API 服务地址
+- `API_SERVER_PORT=8080` - API 服务端口
+
+!!! note
+    Docker 容器默认从 `/app/checkpoints` 读取模型权重。启动容器前请先下载好所需权重。
 
-未完待续。
+!!! warning
+    GPU 支持需要 NVIDIA Docker runtime。若仅使用 CPU,请移除 `--gpus all` 并使用 CPU 镜像。

+ 3 - 3
mkdocs.yml

@@ -1,4 +1,4 @@
-site_name: OpenAudio
+site_name: Fish Audio
 site_description: Targeting SOTA TTS solutions.
 site_url: https://speech.fish.audio
 
@@ -12,7 +12,7 @@ copyright: Copyright &copy; 2023-2025 by Fish Audio
 
 theme:
   name: material
-  favicon: assets/openaudio.png
+  favicon: assets/logo.svg
   language: en
   features:
     - content.action.edit
@@ -25,7 +25,7 @@ theme:
     - search.highlight
     - search.share
     - content.code.copy
-  logo: assets/openaudio.png
+  logo: assets/logo.svg
 
   palette:
     # Palette toggle for automatic mode