Procházet zdrojové kódy

Fix several small bugs. (#1172)

* Fix compile bugs and Report typos.

* fix url link.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PoTaTo před 1 měsícem
rodič
revize
2d37d218cb

binární
FishAudioS2TecReport.pdf


+ 1 - 1
README.md

@@ -156,7 +156,7 @@ Thanks to the expansion of the model context, our model can now use previous inf
 ### Rapid Voice Cloning
 
 Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
-Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the SGLang server.
+Please refer to [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) to use the SGLang server.
 ---
 
 ## Credits

+ 1 - 1
docs/README.ar.md

@@ -156,7 +156,7 @@ Fish Audio S2 هو أحدث نموذج من [Fish Audio](https://fish.audio/). 
 ### استنساخ صوت سريع
 
 يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
-لاستخدام خادم SGLang، راجع https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+لاستخدام خادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 ---
 

+ 1 - 1
docs/README.ja.md

@@ -156,7 +156,7 @@ Fish Audio S2 では、ユーザーが複数のスピーカーを含む参照オ
 ### 高速音声クローニング
 
 Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
-SGLang サーバーの利用については https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md を参照してください。
+SGLang サーバーの利用については [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。
 
 ---
 

+ 1 - 1
docs/README.ko.md

@@ -156,7 +156,7 @@ Fish Audio S2는 사용자가 여러 화자가 포함된 참조 오디오를 업
 ### 빠른 음성 복제
 
 Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
-SGLang 서버 사용은 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 를 참고하세요.
+SGLang 서버 사용은 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) 를 참고하세요.
 
 ---
 

+ 1 - 1
docs/README.pt-BR.md

@@ -156,7 +156,7 @@ Graças à extensão do contexto do modelo, nosso modelo agora pode usar informa
 ### Clonagem de Voz Rápida
 
 O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
-Para usar o servidor SGLang, consulte https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+Para usar o servidor SGLang, consulte [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 ---
 

+ 1 - 1
docs/README.zh.md

@@ -157,7 +157,7 @@ Fish Audio S2 允许用户上传包含多个说话人的参考音频,模型将
 ### 快速语音克隆
 
 Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
-如需使用 SGLang Server,请参考 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 。
+如需使用 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)
 
 ---
 

+ 1 - 1
docs/ar/index.md

@@ -154,7 +154,7 @@ Fish Audio S2 هو أحدث نموذج من [Fish Audio](https://fish.audio/). 
 ### استنساخ صوت سريع
 
 يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
-لاستخدام خادم SGLang، راجع https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+لاستخدام خادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 ---
 

+ 1 - 1
docs/en/index.md

@@ -154,7 +154,7 @@ Thanks to the expansion of the model context, our model can now use previous inf
 ### Rapid Voice Cloning
 
 Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
-Please refer to https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md to use the SGLang server.
+Please refer to [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) to use the SGLang server.
 ---
 
 ## Credits

+ 1 - 1
docs/ja/index.md

@@ -154,7 +154,7 @@ Fish Audio S2 では、ユーザーが複数のスピーカーを含む参照オ
 ### 高速音声クローニング
 
 Fish Audio S2 は、短い参照サンプル(通常10〜30秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情的な傾向を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
-SGLang サーバーの利用については https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md を参照してください。
+SGLang サーバーの利用については [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。
 
 ---
 

+ 1 - 1
docs/ko/index.md

@@ -154,7 +154,7 @@ Fish Audio S2는 사용자가 여러 화자가 포함된 참조 오디오를 업
 ### 빠른 음성 복제
 
 Fish Audio S2는 짧은 참조 샘플(일반적으로 10-30초)을 사용하여 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 캡처하여 추가 미세 조정 없이 사실적이고 일관된 복제 음성을 생성합니다.
-SGLang 서버 사용은 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 를 참고하세요.
+SGLang 서버 사용은 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) 를 참고하세요.
 
 ---
 

+ 1 - 1
docs/pt/index.md

@@ -154,7 +154,7 @@ Graças à extensão do contexto do modelo, nosso modelo agora pode usar informa
 ### Clonagem de Voz Rápida
 
 O Fish Audio S2 suporta clonagem de voz precisa usando uma pequena amostra de referência (tipicamente de 10 a 30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, produzindo vozes clonadas realistas e consistentes sem ajuste fino adicional.
-Para usar o servidor SGLang, consulte https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md .
+Para usar o servidor SGLang, consulte [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
 
 ---
 

+ 1 - 1
docs/zh/index.md

@@ -154,7 +154,7 @@ Fish Audio S2 允许用户上传包含多个说话人的参考音频,模型将
 ### 快速语音克隆
 
 Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
-如需使用 SGLang Server,请参考 https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md 。
+如需使用 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)
 
 ---
 

+ 16 - 21
fish_speech/models/text2semantic/inference.py

@@ -40,10 +40,9 @@ from fish_speech.models.text2semantic.llama import (
 )
 
 
-def multinomial_sample_one_no_sync(
-    probs_sort,
-):  # Does multinomial sampling without a cuda synchronization
-    q = torch.empty_like(probs_sort).exponential_(1)
+def multinomial_sample_one_no_sync(probs_sort):
+    q = torch.rand_like(probs_sort)
+    q = -torch.log(q)
     return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
 
 
@@ -56,19 +55,22 @@ def logits_to_probs(
     logits,
     temperature: torch.Tensor,
     top_p: torch.Tensor,
-    top_k: torch.Tensor,
+    top_k: int,  # 注意: 我看到你传进来的是 int,这很关键
 ) -> torch.Tensor:
-    # Sort and compute top-p mask
     sorted_logits, sorted_indices = torch.sort(logits, descending=True)
     cum_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)
-    sorted_indices_to_remove = cum_probs > top_p
-    # top-k mask
-    sorted_indices_to_remove[top_k:] = True
-    sorted_indices_to_remove[0] = False  # keep at least one option
+
+    indices = torch.arange(sorted_logits.shape[-1], device=sorted_logits.device)
+    top_k_mask = indices >= top_k
+    sorted_indices_to_remove = (cum_probs > top_p) | top_k_mask
+    sorted_indices_to_remove[0] = False  # 单元素修改问题不大,或者写成 | (indices != 0)
+
     indices_to_remove = sorted_indices_to_remove.scatter(
         dim=-1, index=sorted_indices, src=sorted_indices_to_remove
     )
-    logits = logits.masked_fill(indices_to_remove, -float("Inf"))
+    logits = torch.where(
+        indices_to_remove, float("-Inf"), logits
+    )  # 同样替换 masked_fill_ 为 torch.where
     logits = logits / torch.clip(temperature, min=1e-5)
 
     probs = torch.nn.functional.softmax(logits, dim=-1)
@@ -143,19 +145,12 @@ def decode_one_token_ar(
 
     codebooks = [main_token_normal]
 
-    # Only clear cache for fast_layers, avoid clearing main model cache
-    for layer in model.fast_layers:
-        if hasattr(layer, "attention") and hasattr(layer.attention, "kv_cache"):
-            layer.attention.kv_cache.k_cache.fill_(0)
-            layer.attention.kv_cache.v_cache.fill_(0)
-
     input_pos = torch.tensor([0], device=hidden_states.device, dtype=torch.long)
     model.forward_generate_fast(hidden_states, input_pos)
 
-    # [MODIFIED] Access config instead of tokenizer
     a = codebooks[0] - model.config.semantic_begin_id
-    a[a < 0] = 0
-    a[a >= model.config.codebook_size] = 0
+    a = torch.clamp(a, min=0, max=model.config.codebook_size - 1)
+
     hidden_states = model.fast_embeddings(a)
     codebooks.append(a)
 
@@ -390,7 +385,7 @@ def init_model(checkpoint_path, device, precision, compile=False):
         decode_one_token = torch.compile(
             decode_one_token,
             backend="inductor" if torch.cuda.is_available() else "aot_eager",
-            mode="reduce-overhead" if torch.cuda.is_available() else None,
+            mode="default" if torch.cuda.is_available() else None,
             fullgraph=True,
         )