ソースを参照

Fix several small bugs. (#1172)

* Fix compile bugs and Report typos.

* fix url link.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PoTaTo 1 ヶ月 前
コミット
2d37d218cb

BIN
FishAudioS2TecReport.pdf


+ 1 - 1
README.md

@@ -156,7 +156,7 @@ Thanks to the expansion of the model context, our model can now use previous inf
 ### Rapid Voice Cloning
 ### Rapid Voice Cloning
 
 
 Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
 Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
-Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the SGLang server.
+Please refer to [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) to use the SGLang server.
 ---
 ---
 
 
 ## Credits
 ## Credits

+ 1 - 1
docs/README.ar.md

@@ -156,7 +156,7 @@ Fish Audio S2 هو أحدث نموذج من [Fish Audio](https://fish.audio/). 
 ### استنساخ صوت سريع
 ### استنساخ صوت سريع
 
 
 يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
 يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
-لاستخدام خادم SGLang، راجع https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+لاستخدام خادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 
 ---
 ---
 
 

+ 1 - 1
docs/README.ja.md

@@ -156,7 +156,7 @@ Fish Audio S2 では、ユーザーが複数のスピーカーを含む参照オ
 ### 高速音声クローニング
 ### 高速音声クローニング
 
 
 Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
 Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
-SGLang サーバーの利用については https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md を参照してください。
+SGLang サーバーの利用については [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。
 
 
 ---
 ---
 
 

+ 1 - 1
docs/README.ko.md

@@ -156,7 +156,7 @@ Fish Audio S2는 사용자가 여러 화자가 포함된 참조 오디오를 업
 ### 빠른 음성 복제
 ### 빠른 음성 복제
 
 
 Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
 Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
-SGLang 서버 사용은 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 를 참고하세요.
+SGLang 서버 사용은 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) 를 참고하세요.
 
 
 ---
 ---
 
 

+ 1 - 1
docs/README.pt-BR.md

@@ -156,7 +156,7 @@ Graças à extensão do contexto do modelo, nosso modelo agora pode usar informa
 ### Clonagem de Voz Rápida
 ### Clonagem de Voz Rápida
 
 
 O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
 O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
-Para usar o servidor SGLang, consulte https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+Para usar o servidor SGLang, consulte [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 
 ---
 ---
 
 

+ 1 - 1
docs/README.zh.md

@@ -157,7 +157,7 @@ Fish Audio S2 允许用户上传包含多个说话人的参考音频,模型将
 ### 快速语音克隆
 ### 快速语音克隆
 
 
 Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
 Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
-如需使用 SGLang Server,请参考 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 。
+如需使用 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)
 
 
 ---
 ---
 
 

+ 1 - 1
docs/ar/index.md

@@ -154,7 +154,7 @@ Fish Audio S2 هو أحدث نموذج من [Fish Audio](https://fish.audio/). 
 ### استنساخ صوت سريع
 ### استنساخ صوت سريع
 
 
 يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
 يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
-لاستخدام خادم SGLang، راجع https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+لاستخدام خادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 
 ---
 ---
 
 

+ 1 - 1
docs/en/index.md

@@ -154,7 +154,7 @@ Thanks to the expansion of the model context, our model can now use previous inf
 ### Rapid Voice Cloning
 ### Rapid Voice Cloning
 
 
 Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
 Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
-Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the SGLang server.
+Please refer to [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) to use the SGLang server.
 ---
 ---
 
 
 ## Credits
 ## Credits

+ 1 - 1
docs/ja/index.md

@@ -154,7 +154,7 @@ Fish Audio S2 では、ユーザーが複数のスピーカーを含む参照オ
 ### 高速音声クローニング
 ### 高速音声クローニング
 
 
 Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
 Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
-SGLang サーバーの利用については https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md を参照してください。
+SGLang サーバーの利用については [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。
 
 
 ---
 ---
 
 

+ 1 - 1
docs/ko/index.md

@@ -154,7 +154,7 @@ Fish Audio S2는 사용자가 여러 화자가 포함된 참조 오디오를 업
 ### 빠른 음성 복제
 ### 빠른 음성 복제
 
 
 Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
 Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
-SGLang 서버 사용은 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 를 참고하세요.
+SGLang 서버 사용은 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) 를 참고하세요.
 
 
 ---
 ---
 
 

+ 1 - 1
docs/pt/index.md

@@ -154,7 +154,7 @@ Graças à extensão do contexto do modelo, nosso modelo agora pode usar informa
 ### Clonagem de Voz Rápida
 ### Clonagem de Voz Rápida
 
 
 O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
 O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
-Para usar o servidor SGLang, consulte https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+Para usar o servidor SGLang, consulte [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 
 ---
 ---
 
 

+ 1 - 1
docs/zh/index.md

@@ -154,7 +154,7 @@ Fish Audio S2 允许用户上传包含多个说话人的参考音频,模型将
 ### 快速语音克隆
 ### 快速语音克隆
 
 
 Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
 Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
-如需使用 SGLang Server,请参考 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 。
+如需使用 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)
 
 
 ---
 ---
 
 

+ 16 - 21
fish_speech/models/text2semantic/inference.py

@@ -40,10 +40,9 @@ from fish_speech.models.text2semantic.llama import (
 )
 )
 
 
 
 
-def multinomial_sample_one_no_sync(
-    probs_sort,
-):  # Does multinomial sampling without a cuda synchronization
-    q = torch.empty_like(probs_sort).exponential_(1)
+def multinomial_sample_one_no_sync(probs_sort):
+    q = torch.rand_like(probs_sort)
+    q = -torch.log(q)
     return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
     return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
 
 
 
 
@@ -56,19 +55,22 @@ def logits_to_probs(
     logits,
     logits,
     temperature: torch.Tensor,
     temperature: torch.Tensor,
     top_p: torch.Tensor,
     top_p: torch.Tensor,
-    top_k: torch.Tensor,
+    top_k: int,  # 注意: 我看到你传进来的是 int,这很关键
 ) -> torch.Tensor:
 ) -> torch.Tensor:
-    # Sort and compute top-p mask
     sorted_logits, sorted_indices = torch.sort(logits, descending=True)
     sorted_logits, sorted_indices = torch.sort(logits, descending=True)
     cum_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)
     cum_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)
-    sorted_indices_to_remove = cum_probs > top_p
-    # top-k mask
-    sorted_indices_to_remove[top_k:] = True
-    sorted_indices_to_remove[0] = False  # keep at least one option
+
+    indices = torch.arange(sorted_logits.shape[-1], device=sorted_logits.device)
+    top_k_mask = indices >= top_k
+    sorted_indices_to_remove = (cum_probs > top_p) | top_k_mask
+    sorted_indices_to_remove[0] = False  # 单元素修改问题不大,或者写成 | (indices != 0)
+
     indices_to_remove = sorted_indices_to_remove.scatter(
     indices_to_remove = sorted_indices_to_remove.scatter(
         dim=-1, index=sorted_indices, src=sorted_indices_to_remove
         dim=-1, index=sorted_indices, src=sorted_indices_to_remove
     )
     )
-    logits = logits.masked_fill(indices_to_remove, -float("Inf"))
+    logits = torch.where(
+        indices_to_remove, float("-Inf"), logits
+    )  # 同样替换 masked_fill_ 为 torch.where
     logits = logits / torch.clip(temperature, min=1e-5)
     logits = logits / torch.clip(temperature, min=1e-5)
 
 
     probs = torch.nn.functional.softmax(logits, dim=-1)
     probs = torch.nn.functional.softmax(logits, dim=-1)
@@ -143,19 +145,12 @@ def decode_one_token_ar(
 
 
     codebooks = [main_token_normal]
     codebooks = [main_token_normal]
 
 
-    # Only clear cache for fast_layers, avoid clearing main model cache
-    for layer in model.fast_layers:
-        if hasattr(layer, "attention") and hasattr(layer.attention, "kv_cache"):
-            layer.attention.kv_cache.k_cache.fill_(0)
-            layer.attention.kv_cache.v_cache.fill_(0)
-
     input_pos = torch.tensor([0], device=hidden_states.device, dtype=torch.long)
     input_pos = torch.tensor([0], device=hidden_states.device, dtype=torch.long)
     model.forward_generate_fast(hidden_states, input_pos)
     model.forward_generate_fast(hidden_states, input_pos)
 
 
-    # [MODIFIED] Access config instead of tokenizer
     a = codebooks[0] - model.config.semantic_begin_id
     a = codebooks[0] - model.config.semantic_begin_id
-    a[a < 0] = 0
-    a[a >= model.config.codebook_size] = 0
+    a = torch.clamp(a, min=0, max=model.config.codebook_size - 1)
+
     hidden_states = model.fast_embeddings(a)
     hidden_states = model.fast_embeddings(a)
     codebooks.append(a)
     codebooks.append(a)
 
 
@@ -390,7 +385,7 @@ def init_model(checkpoint_path, device, precision, compile=False):
         decode_one_token = torch.compile(
         decode_one_token = torch.compile(
             decode_one_token,
             decode_one_token,
             backend="inductor" if torch.cuda.is_available() else "aot_eager",
             backend="inductor" if torch.cuda.is_available() else "aot_eager",
-            mode="reduce-overhead" if torch.cuda.is_available() else None,
+            mode="default" if torch.cuda.is_available() else None,
             fullgraph=True,
             fullgraph=True,
         )
         )