Просмотр исходного кода

Finish inference.md (#987)

* Finish inference.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix index not found error.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PoTaTo 10 месяцев назад
Родитель
Сommit
b8369249e3

+ 108 - 0
docs/en/index.md

@@ -0,0 +1,108 @@
+# Inference
+
+As the vocoder model has been changed, you need more VRAM than before, 12GB is recommended for fluently inference.
+
+We support command line, HTTP API and WebUI for inference, you can choose any method you like.
+
+## Download Weights
+
+First you need to download the model weights:
+
+```bash
+huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
+```
+
+## Command Line Inference
+
+!!! note
+    If you plan to let the model randomly choose a voice timbre, you can skip this step.
+
+### 1. Get VQ tokens from reference audio
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "ref_audio_name.wav" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
+```
+
+You should get a `fake.npy` and a `fake.wav`.
+
+### 2. Generate semantic tokens from text:
+
+```bash
+python fish_speech/models/text2semantic/inference.py \
+    --text "The text you want to convert" \
+    --prompt-text "Your reference text" \
+    --prompt-tokens "fake.npy" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --num-samples 2 \
+    --compile # if you want a faster speed
+```
+
+This command will create a `codes_N` file in the working directory, where N is an integer starting from 0.
+
+!!! note
+    You may want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~500 tokens/second).
+    Correspondingly, if you do not plan to use acceleration, you can comment out the `--compile` parameter.
+
+!!! info
+    For GPUs that do not support bf16, you may need to use the `--half` parameter.
+
+### 3. Generate vocals from semantic tokens:
+
+#### VQGAN Decoder
+
+!!! warning "Future Warning"
+    We have kept the interface accessible from the original path (tools/vqgan/inference.py), but this interface may be removed in subsequent releases, so please change your code as soon as possible.
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "codes_0.npy" \
+    --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
+```
+
+## HTTP API Inference
+
+We provide a HTTP API for inference. You can use the following command to start the server:
+
+```bash
+python -m tools.api_server \
+    --listen 0.0.0.0:8080 \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+> If you want to speed up inference, you can add the `--compile` parameter.
+
+After that, you can view and test the API at http://127.0.0.1:8080/.
+
+## GUI Inference 
+[Download client](https://github.com/AnyaCoder/fish-speech-gui/releases)
+
+## WebUI Inference
+
+You can start the WebUI using the following command:
+
+```bash
+python -m tools.run_webui \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+Or simply
+
+```bash
+python -m tools.run_webui
+```
+> If you want to speed up inference, you can add the `--compile` parameter.
+
+
+!!! note
+    You can save the label file and reference audio file in advance to the `references` folder in the main directory (which you need to create yourself), so that you can directly call them in the WebUI.
+
+!!! note
+    You can use Gradio environment variables, such as `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` to configure WebUI.
+
+Enjoy!

+ 0 - 0
docs/en/inference.md


+ 107 - 0
docs/ja/index.md

@@ -0,0 +1,107 @@
+# 推論
+
+ボコーダーモデルが変更されたため、以前よりも多くのVRAMが必要です。スムーズな推論には12GBを推奨します。
+
+推論には、コマンドライン、HTTP API、WebUIをサポートしており、お好きな方法を選択できます。
+
+## 重みのダウンロード
+
+まず、モデルの重みをダウンロードする必要があります:
+
+```bash
+huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
+```
+
+## コマンドライン推論
+
+!!! note
+    モデルにランダムに音色を選択させる場合は、この手順をスキップできます。
+
+### 1. 参照音声からVQトークンを取得
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "ref_audio_name.wav" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
+```
+
+`fake.npy` と `fake.wav` が得られるはずです。
+
+### 2. テキストからセマンティックトークンを生成:
+
+```bash
+python fish_speech/models/text2semantic/inference.py \
+    --text "変換したいテキスト" \
+    --prompt-text "参照テキスト" \
+    --prompt-tokens "fake.npy" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --num-samples 2 \
+    --compile # より高速化を求める場合
+```
+
+このコマンドは、作業ディレクトリに `codes_N` ファイルを作成します(Nは0から始まる整数)。
+
+!!! note
+    より高速な推論のために `--compile` を使用してCUDAカーネルを融合することができます(約30トークン/秒 -> 約500トークン/秒)。
+    対応して、加速を使用しない場合は、`--compile` パラメータをコメントアウトできます。
+
+!!! info
+    bf16をサポートしないGPUの場合、`--half` パラメータの使用が必要かもしれません。
+
+### 3. セマンティックトークンから音声を生成:
+
+#### VQGANデコーダー
+
+!!! warning "将来の警告"
+    元のパス(tools/vqgan/inference.py)からアクセス可能なインターフェースを維持していますが、このインターフェースは後続のリリースで削除される可能性があるため、できるだけ早くコードを変更してください。
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "codes_0.npy" \
+    --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
+```
+
+## HTTP API推論
+
+推論用のHTTP APIを提供しています。以下のコマンドでサーバーを開始できます:
+
+```bash
+python -m tools.api_server \
+    --listen 0.0.0.0:8080 \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+> 推論を高速化したい場合は、`--compile` パラメータを追加できます。
+
+その後、http://127.0.0.1:8080/ でAPIを表示・テストできます。
+
+## GUI推論 
+[クライアントをダウンロード](https://github.com/AnyaCoder/fish-speech-gui/releases)
+
+## WebUI推論
+
+以下のコマンドでWebUIを開始できます:
+
+```bash
+python -m tools.run_webui \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+または単純に
+
+```bash
+python -m tools.run_webui
+```
+> 推論を高速化したい場合は、`--compile` パラメータを追加できます。
+
+!!! note
+    ラベルファイルと参照音声ファイルをメインディレクトリの `references` フォルダに事前に保存することができます(自分で作成する必要があります)。これにより、WebUIで直接呼び出すことができます。
+
+!!! note
+    `GRADIO_SHARE`、`GRADIO_SERVER_PORT`、`GRADIO_SERVER_NAME` などのGradio環境変数を使用してWebUIを設定できます。
+
+お楽しみください!

+ 0 - 0
docs/ja/inference.md


+ 107 - 0
docs/ko/index.md

@@ -0,0 +1,107 @@
+# 추론
+
+보코더 모델이 변경되어 이전보다 더 많은 VRAM이 필요하며, 원활한 추론을 위해 12GB를 권장합니다.
+
+추론을 위해 명령줄, HTTP API, WebUI를 지원하며, 원하는 방법을 선택할 수 있습니다.
+
+## 가중치 다운로드
+
+먼저 모델 가중치를 다운로드해야 합니다:
+
+```bash
+huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
+```
+
+## 명령줄 추론
+
+!!! note
+    모델이 임의로 음색을 선택하도록 하려면 이 단계를 건너뛸 수 있습니다.
+
+### 1. 참조 오디오에서 VQ 토큰 얻기
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "ref_audio_name.wav" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
+```
+
+`fake.npy`와 `fake.wav`를 얻을 수 있습니다.
+
+### 2. 텍스트에서 의미 토큰 생성:
+
+```bash
+python fish_speech/models/text2semantic/inference.py \
+    --text "변환하고 싶은 텍스트" \
+    --prompt-text "참조 텍스트" \
+    --prompt-tokens "fake.npy" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --num-samples 2 \
+    --compile # 더 빠른 속도를 원한다면
+```
+
+이 명령은 작업 디렉토리에 `codes_N` 파일을 생성합니다. 여기서 N은 0부터 시작하는 정수입니다.
+
+!!! note
+    더 빠른 추론을 위해 `--compile`을 사용하여 CUDA 커널을 융합할 수 있습니다(약 30 토큰/초 -> 약 500 토큰/초).
+    이에 따라 가속을 사용하지 않으려면 `--compile` 매개변수를 주석 처리할 수 있습니다.
+
+!!! info
+    bf16을 지원하지 않는 GPU의 경우 `--half` 매개변수를 사용해야 할 수 있습니다.
+
+### 3. 의미 토큰에서 음성 생성:
+
+#### VQGAN 디코더
+
+!!! warning "향후 경고"
+    원래 경로(tools/vqgan/inference.py)에서 액세스 가능한 인터페이스를 유지하고 있지만, 이 인터페이스는 향후 릴리스에서 제거될 수 있으므로 가능한 한 빨리 코드를 변경해 주세요.
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "codes_0.npy" \
+    --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
+```
+
+## HTTP API 추론
+
+추론을 위한 HTTP API를 제공합니다. 다음 명령으로 서버를 시작할 수 있습니다:
+
+```bash
+python -m tools.api_server \
+    --listen 0.0.0.0:8080 \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+> 추론을 가속화하려면 `--compile` 매개변수를 추가할 수 있습니다.
+
+그 후 http://127.0.0.1:8080/ 에서 API를 보고 테스트할 수 있습니다.
+
+## GUI 추론 
+[클라이언트 다운로드](https://github.com/AnyaCoder/fish-speech-gui/releases)
+
+## WebUI 추론
+
+다음 명령으로 WebUI를 시작할 수 있습니다:
+
+```bash
+python -m tools.run_webui \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+또는 간단히
+
+```bash
+python -m tools.run_webui
+```
+> 추론을 가속화하려면 `--compile` 매개변수를 추가할 수 있습니다.
+
+!!! note
+    라벨 파일과 참조 오디오 파일을 메인 디렉토리의 `references` 폴더에 미리 저장할 수 있습니다(직접 생성해야 함). 이렇게 하면 WebUI에서 직접 호출할 수 있습니다.
+
+!!! note
+    `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME`과 같은 Gradio 환경 변수를 사용하여 WebUI를 구성할 수 있습니다.
+
+즐기세요!

+ 0 - 0
docs/ko/inference.md


+ 107 - 0
docs/pt/index.md

@@ -0,0 +1,107 @@
+# Inferência
+
+Como o modelo vocoder foi alterado, você precisa de mais VRAM do que antes, sendo recomendado 12GB para inferência fluente.
+
+Suportamos linha de comando, API HTTP e WebUI para inferência, você pode escolher qualquer método que preferir.
+
+## Baixar Pesos
+
+Primeiro você precisa baixar os pesos do modelo:
+
+```bash
+huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
+```
+
+## Inferência por Linha de Comando
+
+!!! note
+    Se você planeja deixar o modelo escolher aleatoriamente um timbre de voz, pode pular esta etapa.
+
+### 1. Obter tokens VQ do áudio de referência
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "ref_audio_name.wav" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
+```
+
+Você deve obter um `fake.npy` e um `fake.wav`.
+
+### 2. Gerar tokens semânticos do texto:
+
+```bash
+python fish_speech/models/text2semantic/inference.py \
+    --text "O texto que você quer converter" \
+    --prompt-text "Seu texto de referência" \
+    --prompt-tokens "fake.npy" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --num-samples 2 \
+    --compile # se você quiser uma velocidade mais rápida
+```
+
+Este comando criará um arquivo `codes_N` no diretório de trabalho, onde N é um inteiro começando de 0.
+
+!!! note
+    Você pode querer usar `--compile` para fundir kernels CUDA para inferência mais rápida (~30 tokens/segundo -> ~500 tokens/segundo).
+    Correspondentemente, se você não planeja usar aceleração, pode comentar o parâmetro `--compile`.
+
+!!! info
+    Para GPUs que não suportam bf16, você pode precisar usar o parâmetro `--half`.
+
+### 3. Gerar vocais a partir de tokens semânticos:
+
+#### Decodificador VQGAN
+
+!!! warning "Aviso Futuro"
+    Mantivemos a interface acessível do caminho original (tools/vqgan/inference.py), mas esta interface pode ser removida em versões subsequentes, então por favor altere seu código o mais breve possível.
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "codes_0.npy" \
+    --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
+```
+
+## Inferência com API HTTP
+
+Fornecemos uma API HTTP para inferência. Você pode usar o seguinte comando para iniciar o servidor:
+
+```bash
+python -m tools.api_server \
+    --listen 0.0.0.0:8080 \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+> Se você quiser acelerar a inferência, pode adicionar o parâmetro `--compile`.
+
+Depois disso, você pode visualizar e testar a API em http://127.0.0.1:8080/.
+
+## Inferência GUI 
+[Baixar cliente](https://github.com/AnyaCoder/fish-speech-gui/releases)
+
+## Inferência WebUI
+
+Você pode iniciar o WebUI usando o seguinte comando:
+
+```bash
+python -m tools.run_webui \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+Ou simplesmente
+
+```bash
+python -m tools.run_webui
+```
+> Se você quiser acelerar a inferência, pode adicionar o parâmetro `--compile`.
+
+!!! note
+    Você pode salvar o arquivo de rótulo e o arquivo de áudio de referência antecipadamente na pasta `references` no diretório principal (que você precisa criar), para que possa chamá-los diretamente no WebUI.
+
+!!! note
+    Você pode usar variáveis de ambiente do Gradio, como `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` para configurar o WebUI.
+
+Divirta-se!

+ 0 - 0
docs/pt/inference.md


+ 107 - 0
docs/zh/index.md

@@ -0,0 +1,107 @@
+# 推理
+
+由于声码器模型已更改,您需要比以前更多的显存,建议使用12GB显存以便流畅推理。
+
+我们支持命令行、HTTP API 和 WebUI 进行推理,您可以选择任何您喜欢的方法。
+
+## 下载权重
+
+首先您需要下载模型权重:
+
+```bash
+huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
+```
+
+## 命令行推理
+
+!!! note
+    如果您计划让模型随机选择音色,可以跳过此步骤。
+
+### 1. 从参考音频获取VQ tokens
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "ref_audio_name.wav" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
+```
+
+您应该会得到一个 `fake.npy` 和一个 `fake.wav`。
+
+### 2. 从文本生成语义tokens:
+
+```bash
+python fish_speech/models/text2semantic/inference.py \
+    --text "您想要转换的文本" \
+    --prompt-text "您的参考文本" \
+    --prompt-tokens "fake.npy" \
+    --checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --num-samples 2 \
+    --compile # 如果您想要更快的速度
+```
+
+此命令将在工作目录中创建一个 `codes_N` 文件,其中N是从0开始的整数。
+
+!!! note
+    您可能想要使用 `--compile` 来融合CUDA内核以获得更快的推理速度(约30 tokens/秒 -> 约500 tokens/秒)。
+    相应地,如果您不打算使用加速,可以删除 `--compile` 参数的注释。
+
+!!! info
+    对于不支持bf16的GPU,您可能需要使用 `--half` 参数。
+
+### 3. 从语义tokens生成人声:
+
+#### VQGAN 解码器
+
+!!! warning "未来警告"
+    我们保留了从原始路径(tools/vqgan/inference.py)访问的接口,但此接口可能在后续版本中被移除,请尽快更改您的代码。
+
+```bash
+python fish_speech/models/dac/inference.py \
+    -i "codes_0.npy" \
+    --checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
+```
+
+## HTTP API 推理
+
+我们提供HTTP API进行推理。您可以使用以下命令启动服务器:
+
+```bash
+python -m tools.api_server \
+    --listen 0.0.0.0:8080 \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+> 如果您想要加速推理,可以添加 `--compile` 参数。
+
+之后,您可以在 http://127.0.0.1:8080/ 查看和测试API。
+
+## GUI 推理 
+[下载客户端](https://github.com/AnyaCoder/fish-speech-gui/releases)
+
+## WebUI 推理
+
+您可以使用以下命令启动WebUI:
+
+```bash
+python -m tools.run_webui \
+    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
+    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
+    --decoder-config-name modded_dac_vq
+```
+
+或者简单地
+
+```bash
+python -m tools.run_webui
+```
+> 如果您想要加速推理,可以添加 `--compile` 参数。
+
+!!! note
+    您可以提前将标签文件和参考音频文件保存到主目录的 `references` 文件夹中(需要自己创建),这样就可以在WebUI中直接调用它们。
+
+!!! note
+    您可以使用Gradio环境变量,如 `GRADIO_SHARE`、`GRADIO_SERVER_PORT`、`GRADIO_SERVER_NAME` 来配置WebUI。
+
+尽情享受吧!

+ 0 - 0
docs/zh/inference.md


+ 6 - 20
mkdocs.yml

@@ -56,11 +56,8 @@ theme:
         code: Roboto Mono
 
 nav:
-  - Introduction: index.md
-  - Finetune: finetune.md
-  - Inference: inference.md
-  - Start Agent: start_agent.md
-  - Samples: samples.md
+  - Installation: en/install.md
+  - Inference: en/inference.md
 
 # Plugins
 plugins:
@@ -83,37 +80,26 @@ plugins:
           name: 简体中文
           build: true
           nav:
-            - 介绍: zh/index.md
-            - 微调: zh/finetune.md
+            - 安装: zh/install.md
             - 推理: zh/inference.md
-            - 启动Agent: zh/start_agent.md
-            - 例子: zh/samples.md
         - locale: ja
           name: 日本語
           build: true
           nav:
-            - Fish Speech の紹介: ja/index.md
-            - 微調整: ja/finetune.md
+            - インストール: ja/install.md
             - 推論: ja/inference.md
-            - スタートエージェント: ja/start_agent.md
-            - サンプル: ja/samples.md
         - locale: pt
           name: Português (Brasil)
           build: true
           nav:
-            - Introdução: pt/index.md
-            - Ajuste Fino: pt/finetune.md
+            - Instalação: pt/install.md
             - Inferência: pt/inference.md
-            - Agente inicial: pt/start_agent.md
-            - Amostras: pt/samples.md
         - locale: ko
           name: 한국어
           build: true
           nav:
-            - 소개: ko/index.md
-            - 파인튜닝: ko/finetune.md
+            - 설치: ko/install.md
             - 추론: ko/inference.md
-            - 샘플: ko/samples.md
 
 markdown_extensions:
   - pymdownx.highlight:

+ 2 - 2
tools/download_models.py

@@ -22,7 +22,7 @@ def check_and_download_files(repo_id, file_list, local_dir):
 
 
 # 1st
-repo_id_1 = "fishaudio/fish-speech-1.5"
+repo_id_1 = "fishaudio/openaudio-s1-mini"
 local_dir_1 = "./checkpoints/openaudio-s1-mini"
 files_1 = [
     ".gitattributes",
@@ -31,7 +31,7 @@ files_1 = [
     "special_tokens.json",
     "tokenizer.tiktoken",
     "config.json",
-    "firefly-gan-vq-fsq-8x1024-21hz-generator.pth",
+    "codec.pth",
 ]
 
 # 3rd

+ 0 - 169
tools/llama/build_dataset.py

@@ -1,169 +0,0 @@
-import itertools
-import os
-import re
-from collections import defaultdict
-from functools import partial
-from multiprocessing import Pool
-from pathlib import Path
-
-import click
-import numpy as np
-from loguru import logger
-from tqdm import tqdm
-
-from fish_speech.datasets.protos.text_data_pb2 import Semantics, Sentence, TextData
-from fish_speech.datasets.protos.text_data_stream import pack_pb_stream
-from fish_speech.utils.file import load_filelist
-
-# To avoid CPU overload
-os.environ["MKL_NUM_THREADS"] = "1"
-os.environ["OMP_NUM_THREADS"] = "1"
-
-
-def task_generator_folder(root: Path, text_extension: str):
-    files = list(tqdm(Path(root).rglob("*.npy"), desc=f"Loading {root}"))
-    files = sorted(files)
-
-    grouped_files = defaultdict(list)
-    for file in tqdm(files, desc=f"Grouping {root}"):
-        p = str(file.parent)
-        speaker = file.parent.name
-
-        try:
-            if isinstance(text_extension, str):
-                texts = [file.with_suffix(text_extension).read_text(encoding="utf-8")]
-            else:
-                texts = [
-                    file.with_suffix(ext).read_text(encoding="utf-8")
-                    for ext in text_extension
-                ]
-        except Exception as e:
-            logger.error(f"Failed to read text {file}: {e}")
-            continue
-
-        grouped_files[p].append((speaker, file, texts))
-
-    logger.info(
-        f"Found {len(grouped_files)} groups in {root}, {list(grouped_files.keys())[:5]}..."
-    )
-
-    for i in grouped_files.values():
-        subset = [(f, t) for _, f, t in i]
-        yield i[0][0], subset, "folder"
-
-
-def task_generator_filelist(filelist):
-    grouped_files = defaultdict(list)
-    for filename, speaker, _, text in load_filelist(filelist):
-        grouped_files[speaker].append((Path(filename), [text]))
-
-    logger.info(f"Found {len(grouped_files)} groups in {filelist}")
-    for speaker, values in grouped_files.items():
-        yield speaker, values, "filelist"
-
-
-def run_task(task):
-    name, subset, source = task
-
-    # Parse the files
-    sentences = []
-    for file, texts in subset:
-        np_file = file.with_suffix(".npy")
-        if np_file.exists() is False:
-            logger.warning(f"Can't find {np_file}")
-            continue
-
-        new_texts = []
-
-        for text in texts:
-            # Simple cleaning: replace { xxx } and < xxx > with space
-            text = re.sub(r"\{.*?\}", " ", text)
-            text = re.sub(r"<.*?>", " ", text)
-            text = re.sub(r"\s+", " ", text)
-            new_texts.append(text)
-
-        try:
-            semantics = np.load(np_file)
-        except Exception as e:
-            logger.error(f"Failed to parse {file}: {e}")
-            continue
-
-        if isinstance(semantics, np.ndarray):
-            semantics = semantics.tolist()
-
-        sentences.append(
-            Sentence(
-                texts=new_texts,
-                semantics=[Semantics(values=s) for s in semantics],
-            )
-        )
-
-    # Pack the sentences
-    return pack_pb_stream(
-        TextData(
-            source=source,
-            name=name,
-            sentences=sentences,
-        )
-    )
-
-
-@click.command()
-@click.option(
-    "--input",
-    type=click.Path(path_type=Path),
-    required=True,
-    help="A folder containing the dataset or a filelist",
-    multiple=True,
-)
-@click.option(
-    "--output", type=click.Path(path_type=Path), default="data/quantized-dataset-ft"
-)
-@click.option("--num-workers", type=int, default=16)
-@click.option("--text-extension", type=str, default=[".txt"], multiple=True)
-@click.option(
-    "--shard-size", type=int, default=10, help="The maximum size of each shard in mb"
-)
-def main(input, output, num_workers, text_extension, shard_size):
-    generator_fns = []
-
-    for f in input:
-        assert f.exists(), f"{f} not found"
-
-        if f.is_dir():
-            generator_fn = task_generator_folder(f, text_extension)
-        else:
-            generator_fn = task_generator_filelist(f)
-
-        generator_fns.append(generator_fn)
-
-    generator_fn = itertools.chain(*generator_fns)
-    output.mkdir(parents=True, exist_ok=True)
-
-    dataset_fp = None
-    tar_idx = 0
-    written_size = 0
-
-    with Pool(num_workers) as p:
-        for result in tqdm(p.imap_unordered(run_task, generator_fn)):
-            if dataset_fp is None:
-                dataset_fp = open(Path(output) / f"{tar_idx:08d}.protos", "wb")
-
-            dataset_fp.write(result)
-            written_size += len(result)
-
-            if written_size > shard_size * 1024 * 1024:
-                logger.info(f"Finished writing {tar_idx} shards to {output}")
-                dataset_fp.close()
-                dataset_fp = None
-                written_size = 0
-                tar_idx += 1
-
-    if dataset_fp is not None:
-        dataset_fp.close()
-
-    logger.info(f"Finished writing {tar_idx + 1} shards to {output}")
-
-
-if __name__ == "__main__":
-    main()

+ 0 - 171
tools/llama/eval_in_context.py

@@ -1,171 +0,0 @@
-import pyrootutils
-import torch
-import torch.nn.functional as F
-from matplotlib import pyplot as plt
-from transformers import AutoTokenizer
-
-# register eval resolver and root
-pyrootutils.setup_root(__file__, indicator=".project-root", pythonpath=True)
-
-from torch.utils.data import DataLoader
-
-from fish_speech.datasets.semantic import AutoAugTextDataset, TextDataCollator
-from fish_speech.models.text2semantic.inference import load_model
-
-
-def smooth(
-    scalars: list[float], weight: float
-) -> list[float]:  # Weight between 0 and 1
-    last = scalars[0]  # First value in the plot (first timestep)
-    smoothed = list()
-    for point in scalars:
-        smoothed_val = last * weight + (1 - weight) * point  # Calculate smoothed value
-        smoothed.append(smoothed_val)  # Save it
-        last = smoothed_val  # Anchor the last smoothed value
-
-    return smoothed
-
-
-@torch.inference_mode()
-def analyze_one_model(loader, config, weight, max_length):
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    model = load_model(
-        config,
-        weight,
-        device,
-        torch.bfloat16,
-        max_length,
-        compile=False,
-    )[0]
-
-    current_step = 0
-    model.eval()
-
-    semantic_loss_sum = torch.zeros(
-        max_length,
-        dtype=torch.float32,
-        device=device,
-    )
-    counter = torch.zeros(
-        max_length,
-        dtype=torch.long,
-        device=device,
-    )
-
-    for batch in loader:
-        batch = {k: v.to(device) for k, v in batch.items()}
-
-        labels = batch["labels"]
-        outputs = model(
-            inp=batch["inputs"],
-            key_padding_mask=batch["attention_masks"],
-        )
-
-        token_logits = outputs.token_logits
-        codebook_logits = outputs.codebook_logits
-
-        # Generate labels
-        base_loss = F.cross_entropy(
-            token_logits.reshape(-1, token_logits.size(-1)),
-            labels[:, 0].reshape(-1),
-            ignore_index=-100,
-            reduction="none",
-        )
-
-        codebook_labels = labels[:, 1 : 1 + model.config.num_codebooks].mT
-        semantic_loss = F.cross_entropy(
-            codebook_logits.reshape(-1, codebook_logits.size(-1)),
-            codebook_labels.reshape(-1),
-            ignore_index=-100,
-            reduction="none",
-        )
-
-        base_loss = base_loss.reshape(labels[:, 0].shape)
-        semantic_loss = semantic_loss.reshape(codebook_labels.shape)
-
-        semantic_loss_frame = semantic_loss.mean(-1)
-        pad_pos = codebook_labels.sum(-1) == -100 * model.config.num_codebooks
-
-        for loss_sample, pad in zip(semantic_loss_frame, pad_pos):
-            semantic_loss_sum[~pad] += loss_sample[~pad]
-            counter[~pad] += 1
-
-        current_step += 1
-        if current_step == 10:
-            break
-
-    semantic_loss = semantic_loss.cpu()
-    counter = counter.cpu()
-    xs, ys = [], []
-
-    for i, (loss, count) in enumerate(zip(semantic_loss_sum, counter)):
-        if count > 0:
-            xs.append(i)
-            ys.append((loss / count).item())  # for better loss visualization
-
-    smoothed_ys = smooth(ys, 0.95)
-
-    # Unload model
-    del model
-    torch.cuda.empty_cache()
-
-    return xs, ys, smoothed_ys
-
-
-def main():
-    tokenizer = AutoTokenizer.from_pretrained("fishaudio/fish-speech-1")
-    max_length = 4096
-
-    ds = AutoAugTextDataset(
-        ["data/protos/sft/云天河"],
-        tokenizer=tokenizer,
-        use_speaker=False,
-        interactive_prob=1.0,
-        max_length=max_length,
-    )
-
-    loader = DataLoader(
-        ds,
-        batch_size=8,
-        collate_fn=TextDataCollator(tokenizer, max_length=max_length),
-        num_workers=0,
-        shuffle=False,
-    )
-
-    plt.figure(figsize=(10, 5), dpi=200)
-
-    plt.xlabel("Frame")
-    plt.ylabel("Loss")
-    plt.yscale("log")
-    plt.title("Semantic Loss")
-    plt.grid(which="both", axis="both")
-    plt.xlim(0, max_length)
-
-    tests = [
-        (
-            "pertrain-medium",
-            "dual_ar_2_codebook_medium",
-            "checkpoints/text2semantic-pretrain-medium-2k-v1.pth",
-        ),
-        (
-            "sft-medium",
-            "dual_ar_2_codebook_medium",
-            "checkpoints/text2semantic-sft-medium-v1.1-4k.pth",
-        ),
-        (
-            "sft-large",
-            "dual_ar_2_codebook_large",
-            "checkpoints/text2semantic-sft-large-v1.1-4k.pth",
-        ),
-    ]
-
-    for name, config, weight in tests:
-        xs, _, smoothed_ys = analyze_one_model(loader, config, weight, max_length)
-        plt.plot(xs, smoothed_ys, label=name)
-
-    plt.legend()
-    plt.savefig("semantic_loss.png")
-
-
-if __name__ == "__main__":
-    main()

+ 0 - 96
tools/llama/merge_lora.py

@@ -1,96 +0,0 @@
-import shutil
-from copy import deepcopy
-from pathlib import Path
-
-import click
-import hydra
-import torch
-from hydra import compose, initialize
-from hydra.utils import instantiate
-from loguru import logger
-
-from fish_speech.models.text2semantic.llama import BaseTransformer
-from fish_speech.models.text2semantic.lora import get_merged_state_dict
-
-
-@click.command()
-@click.option("--lora-config", type=str, default="r_8_alpha_16")
-@click.option("--base-weight", type=str, default="checkpoints/fish-speech-1.4")
-@click.option("--lora-weight", type=str, required=True)
-@click.option("--output", type=str, required=True)
-def merge(lora_config, base_weight, lora_weight, output):
-    output = Path(output)
-    logger.info(
-        f"Merging {base_weight} and {lora_weight} into {output} with {lora_config}"
-    )
-
-    with initialize(version_base="1.3", config_path="../../fish_speech/configs/lora"):
-        cfg = compose(config_name=lora_config)
-
-    lora_config = instantiate(cfg)
-    logger.info(f"Loaded lora model with config {lora_config}")
-
-    llama_model = BaseTransformer.from_pretrained(
-        path=base_weight,
-        load_weights=True,
-        lora_config=lora_config,
-    )
-    logger.info(f"Loaded llama model")
-
-    llama_state_dict = llama_model.state_dict()
-    llama_state_dict = {k: v for k, v in llama_state_dict.items() if "lora" not in k}
-    llama_state_dict_copy = deepcopy(llama_state_dict)
-    lora_state_dict = torch.load(lora_weight, map_location="cpu", weights_only=False)
-
-    if "state_dict" in llama_state_dict:
-        llama_state_dict = llama_state_dict["state_dict"]
-
-    if "state_dict" in lora_state_dict:
-        lora_state_dict = lora_state_dict["state_dict"]
-
-    # remove prefix model.
-    if any(k.startswith("model.") for k in llama_state_dict.keys()):
-        llama_state_dict = {
-            k.replace("model.", ""): v
-            for k, v in llama_state_dict.items()
-            if k.startswith("model.")
-        }
-    if any(k.startswith("model.") for k in lora_state_dict.keys()):
-        lora_state_dict = {
-            k.replace("model.", ""): v
-            for k, v in lora_state_dict.items()
-            if k.startswith("model.")
-        }
-
-    logger.info(f"Found {len(llama_state_dict)} keys in llama model")
-    logger.info(f"Found {len(lora_state_dict)} keys in lora model")
-
-    merged_state_dict = llama_state_dict | lora_state_dict
-    llama_model.load_state_dict(merged_state_dict, strict=True)
-    logger.info(f"Merged model loaded")
-
-    # Trigger eval mode to merge lora
-    llama_model.eval()
-    llama_model.save_pretrained(output, drop_lora=True)
-    logger.info(f"Saved merged model to {output}, validating")
-
-    new_state_dict = torch.load(output / "model.pth", map_location="cpu")
-    original_keys = set(llama_state_dict_copy.keys())
-
-    tolerance = 1e-5
-    for key in original_keys:
-        diff_l1 = (new_state_dict[key] - llama_state_dict_copy[key]).abs().sum().item()
-        if diff_l1 > tolerance:
-            logger.info(f"Significant difference found in key: {key}")
-            break
-
-    if diff_l1 <= tolerance:
-        logger.warning(
-            "Merged model seems identical to the original model. Further validation might be needed."
-        )
-    else:
-        logger.info("Merged model is different from the original model, check passed")
-
-
-if __name__ == "__main__":
-    merge()