2 years ago · f2c7eedf6f
--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -18,7 +18,7 @@ We assume no responsibility for any illegal use of the codebase. Please refer to
 
				 This codebase is released under the `BSD-3-Clause` license, and all models are released under the CC-BY-NC-SA-4.0 license.
			
 
				 
			
 
				 <p align="center">
			
 
				-<img src="/docs/assets/figs/diagram.png" width="75%">
			
 
				+   <img src="/docs/assets/figs/diagram.png" width="75%">
			
 
				 </p>
			
 
				 
			
 
				 ## Requirements
			
--- a/docs/en/inference.md
+++ b/docs/en/inference.md
@@ -13,7 +13,7 @@ Inference support command line, HTTP API and web UI.
 
				 ## Command Line Inference
			
 
				 
			
 
				 Download the required `vqgan` and `llama` models from our Hugging Face repository.
			
 
				-    
			
 
				+
			
 
				 ```bash
			
 
				 huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
@@ -28,9 +28,11 @@ python tools/vqgan/inference.py \
 
				     -i "paimon.wav" \
			
 
				     --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
			
 
				 ```
			
 
				+
			
 
				 You should get a `fake.npy` file.
			
 
				 
			
 
				 ### 2. Generate semantic tokens from text:
			
 
				+
			
 
				 ```bash
			
 
				 python tools/llama/generate.py \
			
 
				     --text "The text you want to convert" \
			
@@ -53,6 +55,7 @@ This command will create a `codes_N` file in the working directory, where N is a
 
				 ### 3. Generate vocals from semantic tokens:
			
 
				 
			
 
				 #### VQGAN Decoder (not recommended)
			
 
				+
			
 
				 ```bash
			
 
				 python tools/vqgan/inference.py \
			
 
				     -i "codes_0.npy" \
			
@@ -69,10 +72,69 @@ python -m tools.api \
 
				     --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
			
 
				     --decoder-config-name firefly_gan_vq
			
 
				+```
			
 
				 
			
 
				 If you want to speed up inference, you can add the --compile parameter.
			
 
				 
			
 
				-After that, you can view and test the API at http://127.0.0.1:8000/.  
			
 
				+After that, you can view and test the API at http://127.0.0.1:8000/.
			
 
				+
			
 
				+Below is an example of sending a request using `tools/post_api.py`.
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.post_api \
			
 
				+    --text "Text to be input" \
			
 
				+    --reference_audio "Path to reference audio" \
			
 
				+    --reference_text "Text content of the reference audio"
			
 
				+    --streaming True
			
 
				+```
			
 
				+
			
 
				+The above command indicates synthesizing the desired audio according to the reference audio information and returning it in a streaming manner.
			
 
				+
			
 
				+If you need to randomly select reference audio based on `{SPEAKER}` and `{EMOTION}`, configure it according to the following steps:
			
 
				+
			
 
				+### 1. Create a `ref_data` folder in the root directory of the project.
			
 
				+
			
 
				+### 2. Create a directory structure similar to the following within the `ref_data` folder.
			
 
				+
			
 
				+```
			
 
				+.
			
 
				+├── SPEAKER1
			
 
				+│    ├──EMOTION1
			
 
				+│    │    ├── 21.15-26.44.lab
			
 
				+│    │    ├── 21.15-26.44.wav
			
 
				+│    │    ├── 27.51-29.98.lab
			
 
				+│    │    ├── 27.51-29.98.wav
			
 
				+│    │    ├── 30.1-32.71.lab
			
 
				+│    │    └── 30.1-32.71.flac
			
 
				+│    └──EMOTION2
			
 
				+│         ├── 30.1-32.71.lab
			
 
				+│         └── 30.1-32.71.mp3
			
 
				+└── SPEAKER2
			
 
				+    └─── EMOTION3
			
 
				+          ├── 30.1-32.71.lab
			
 
				+          └── 30.1-32.71.mp3
			
 
				+```
			
 
				+
			
 
				+That is, first place `{SPEAKER}` folders in `ref_data`, then place `{EMOTION}` folders under each speaker, and place any number of `audio-text pairs` under each emotion folder.
			
 
				+
			
 
				+### 3. Enter the following command in the virtual environment
			
 
				+
			
 
				+```bash
			
 
				+python tools/gen_ref.py
			
 
				+
			
 
				+```
			
 
				+
			
 
				+### 4. Call the API.
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.post_api \
			
 
				+    --text "Text to be input" \
			
 
				+    --speaker "${SPEAKER1}" \
			
 
				+    --emotion "${EMOTION1}"
			
 
				+    --streaming True
			
 
				+```
			
 
				+
			
 
				+The above example is for testing purposes only.
			
 
				 
			
 
				 ## WebUI Inference
			
 
				 
			
--- a/docs/ja/index.md
+++ b/docs/ja/index.md
@@ -13,24 +13,24 @@
 
				 </div>
			
 
				 
			
 
				 !!! warning
			
 
				-私たちは、コードベースの違法な使用について一切の責任を負いません。お住まいの地域のDMCA（デジタルミレニアム著作権法）およびその他の関連法については、現地の法律を参照してください。
			
 
				+私たちは、コードベースの違法な使用について一切の責任を負いません。お住まいの地域の DMCA（デジタルミレニアム著作権法）およびその他の関連法については、現地の法律を参照してください。
			
 
				 
			
 
				 このコードベースは `BSD-3-Clause` ライセンスの下でリリースされており、すべてのモデルは CC-BY-NC-SA-4.0 ライセンスの下でリリースされています。
			
 
				 
			
 
				 <p align="center">
			
 
				-<img src="/docs/assets/figs/diagram.png" width="75%">
			
 
				+   <img src="/docs/assets/figs/diagram.png" width="75%">
			
 
				 </p>
			
 
				 
			
 
				 ## 要件
			
 
				 
			
 
				-- GPUメモリ: 4GB（推論用）、16GB（微調整用）
			
 
				+- GPU メモリ: 4GB（推論用）、16GB（微調整用）
			
 
				 - システム: Linux、Windows
			
 
				 
			
 
				-## Windowsセットアップ
			
 
				+## Windows セットアップ
			
 
				 
			
 
				-Windowsのプロユーザーは、コードベースを実行するためにWSL2またはDockerを検討することができます。
			
 
				+Windows のプロユーザーは、コードベースを実行するために WSL2 または Docker を検討することができます。
			
 
				 
			
 
				-非プロのWindowsユーザーは、Linux環境なしでコードベースを実行するために以下の方法を検討することができます（モデルコンパイル機能付き、つまり `torch.compile`）：
			
 
				+非プロの Windows ユーザーは、Linux 環境なしでコードベースを実行するために以下の方法を検討することができます（モデルコンパイル機能付き、つまり `torch.compile`）：
			
 
				 
			
 
				 <ol>
			
 
				    <li>プロジェクトパッケージを解凍します。</li>
			
@@ -88,7 +88,7 @@ Windowsのプロユーザーは、コードベースを実行するためにWSL2
 
				    <li>（オプション）<code>run_cmd.bat</code>をダブルクリックして、このプロジェクトのconda/pythonコマンドライン環境に入ります。</li>
			
 
				 </ol>
			
 
				 
			
 
				-## Linuxセットアップ
			
 
				+## Linux セットアップ
			
 
				 
			
 
				 ```bash
			
 
				 # python 3.10仮想環境を作成します。virtualenvも使用できます。
			
@@ -107,15 +107,15 @@ apt install libsox-dev
 
				 
			
 
				 ## 変更履歴
			
 
				 
			
 
				-- 2024/07/02: Fish-Speechを1.2バージョンに更新し、VITSデコーダーを削除し、ゼロショット能力を大幅に強化しました。
			
 
				-- 2024/05/10: Fish-Speechを1.1バージョンに更新し、VITSデコーダーを実装してWERを減少させ、音色の類似性を向上させました。
			
 
				-- 2024/04/22: Fish-Speech 1.0バージョンを完成させ、VQGANおよびLLAMAモデルを大幅に修正しました。
			
 
				+- 2024/07/02: Fish-Speech を 1.2 バージョンに更新し、VITS デコーダーを削除し、ゼロショット能力を大幅に強化しました。
			
 
				+- 2024/05/10: Fish-Speech を 1.1 バージョンに更新し、VITS デコーダーを実装して WER を減少させ、音色の類似性を向上させました。
			
 
				+- 2024/04/22: Fish-Speech 1.0 バージョンを完成させ、VQGAN および LLAMA モデルを大幅に修正しました。
			
 
				 - 2023/12/28: `lora`微調整サポートを追加しました。
			
 
				 - 2023/12/27: `gradient checkpointing`、`causual sampling`、および`flash-attn`サポートを追加しました。
			
 
				-- 2023/12/19: webuiおよびHTTP APIを更新しました。
			
 
				+- 2023/12/19: webui および HTTP API を更新しました。
			
 
				 - 2023/12/18: 微調整ドキュメントおよび関連例を更新しました。
			
 
				 - 2023/12/17: `text2semantic`モデルを更新し、音素フリーモードをサポートしました。
			
 
				-- 2023/12/13: ベータ版をリリースし、VQGANモデルおよびLLAMAに基づく言語モデル（音素のみサポート）を含みます。
			
 
				+- 2023/12/13: ベータ版をリリースし、VQGAN モデルおよび LLAMA に基づく言語モデル（音素のみサポート）を含みます。
			
 
				 
			
 
				 ## 謝辞
			
 
				 
			
--- a/docs/ja/inference.md
+++ b/docs/ja/inference.md
@@ -1,6 +1,6 @@
 
				 # 推論
			
 
				 
			
 
				-推論は、コマンドライン、HTTP API、およびWeb UIをサポートしています。
			
 
				+推論は、コマンドライン、HTTP API、および Web UI をサポートしています。
			
 
				 
			
 
				 !!! note
			
 
				     全体として、推論は次のいくつかの部分で構成されています：
			
@@ -12,8 +12,8 @@
 
				 
			
 
				 ## コマンドライン推論
			
 
				 
			
 
				-必要な`vqgan`および`llama`モデルをHugging Faceリポジトリからダウンロードします。
			
 
				-    
			
 
				+必要な`vqgan`および`llama`モデルを Hugging Face リポジトリからダウンロードします。
			
 
				+
			
 
				 ```bash
			
 
				 huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
@@ -28,9 +28,11 @@ python tools/vqgan/inference.py \
 
				     -i "paimon.wav" \
			
 
				     --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
			
 
				 ```
			
 
				+
			
 
				 `fake.npy`ファイルが生成されるはずです。
			
 
				 
			
 
				 ### 2. テキストからセマンティックトークンを生成する：
			
 
				+
			
 
				 ```bash
			
 
				 python tools/llama/generate.py \
			
 
				     --text "変換したいテキスト" \
			
@@ -41,27 +43,28 @@ python tools/llama/generate.py \
 
				     --compile
			
 
				 ```
			
 
				 
			
 
				-このコマンドは、作業ディレクトリに`codes_N`ファイルを作成します。ここで、Nは0から始まる整数です。
			
 
				+このコマンドは、作業ディレクトリに`codes_N`ファイルを作成します。ここで、N は 0 から始まる整数です。
			
 
				 
			
 
				 !!! note
			
 
				-    `--compile`を使用してCUDAカーネルを融合し、より高速な推論を実現することができます（約30トークン/秒 -> 約500トークン/秒）。
			
 
				+    `--compile`を使用して CUDA カーネルを融合し、より高速な推論を実現することができます（約 30 トークン/秒 -> 約 500 トークン/秒）。
			
 
				     それに対応して、加速を使用しない場合は、`--compile`パラメータをコメントアウトできます。
			
 
				 
			
 
				 !!! info
			
 
				-    bf16をサポートしていないGPUの場合、`--half`パラメータを使用する必要があるかもしれません。
			
 
				+    bf16 をサポートしていない GPU の場合、`--half`パラメータを使用する必要があるかもしれません。
			
 
				 
			
 
				 ### 3. セマンティックトークンから音声を生成する：
			
 
				 
			
 
				-#### VQGANデコーダー（推奨されません）
			
 
				+#### VQGAN デコーダー（推奨されません）
			
 
				+
			
 
				 ```bash
			
 
				 python tools/vqgan/inference.py \
			
 
				     -i "codes_0.npy" \
			
 
				     --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
			
 
				 ```
			
 
				 
			
 
				-## HTTP API推論
			
 
				+## HTTP API 推論
			
 
				 
			
 
				-推論のためのHTTP APIを提供しています。次のコマンドを使用してサーバーを起動できます：
			
 
				+推論のための HTTP API を提供しています。次のコマンドを使用してサーバーを起動できます：
			
 
				 
			
 
				 ```bash
			
 
				 python -m tools.api \
			
@@ -69,14 +72,75 @@ python -m tools.api \
 
				     --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
			
 
				     --decoder-config-name firefly_gan_vq
			
 
				+```
			
 
				+
			
 
				+推論を高速化したい場合は、--compile パラメータを追加できます。
			
 
				+
			
 
				+その後、`http://127.0.0.1:8000/`で API を表示およびテストできます。
			
 
				+
			
 
				+以下は、`tools/post_api.py` を使用してリクエストを送信する例です。
			
 
				+
			
 
				+```bash
			
 
				+python tools/vqgan/inference.py \
			
 
				+    -i "paimon.wav" \
			
 
				+    --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
			
 
				+```
			
 
				+
			
 
				+上記のコマンドは、参照音声の情報に基づいて必要な音声を合成し、ストリーミング方式で返すことを示しています。
			
 
				+
			
 
				+`{SPEAKER}`と`{EMOTION}`に基づいて参照音声をランダムに選択する必要がある場合は、以下の手順に従って設定します：
			
 
				+
			
 
				+### 1. プロジェクトのルートディレクトリに`ref_data`フォルダを作成します。
			
 
				 
			
 
				-推論を高速化したい場合は、--compileパラメータを追加できます。
			
 
				+### 2. `ref_data`フォルダ内に次のような構造のディレクトリを作成します。
			
 
				+
			
 
				+```
			
 
				+.
			
 
				+├── SPEAKER1
			
 
				+│    ├──EMOTION1
			
 
				+│    │    ├── 21.15-26.44.lab
			
 
				+│    │    ├── 21.15-26.44.wav
			
 
				+│    │    ├── 27.51-29.98.lab
			
 
				+│    │    ├── 27.51-29.98.wav
			
 
				+│    │    ├── 30.1-32.71.lab
			
 
				+│    │    └── 30.1-32.71.flac
			
 
				+│    └──EMOTION2
			
 
				+│         ├── 30.1-32.71.lab
			
 
				+│         └── 30.1-32.71.mp3
			
 
				+└── SPEAKER2
			
 
				+    └─── EMOTION3
			
 
				+          ├── 30.1-32.71.lab
			
 
				+          └── 30.1-32.71.mp3
			
 
				+
			
 
				+```
			
 
				+
			
 
				+つまり、まず`ref_data`に`{SPEAKER}`フォルダを配置し、各スピーカーの下に`{EMOTION}`フォルダを配置し、各感情フォルダの下に任意の数の音声-テキストペアを配置します
			
 
				+
			
 
				+### 3. 仮想環境で以下のコマンドを入力します.
			
 
				+
			
 
				+```bash
			
 
				+python tools/gen_ref.py
			
 
				+
			
 
				+```
			
 
				+
			
 
				+参照ディレクトリを生成します。
			
 
				+
			
 
				+### 4. API を呼び出します。
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.post_api \
			
 
				+    --text "入力するテキスト" \
			
 
				+    --speaker "${SPEAKER1}" \
			
 
				+    --emotion "${EMOTION1}"
			
 
				+    --streaming True
			
 
				+
			
 
				+```
			
 
				 
			
 
				-その後、http://127.0.0.1:8000/でAPIを表示およびテストできます。
			
 
				+上記の例はテスト目的のみです。
			
 
				 
			
 
				-## WebUI推論
			
 
				+## WebUI 推論
			
 
				 
			
 
				-次のコマンドを使用してWebUIを起動できます：
			
 
				+次のコマンドを使用して WebUI を起動できます：
			
 
				 
			
 
				 ```bash
			
 
				 python -m tools.webui \
			
@@ -86,6 +150,6 @@ python -m tools.webui \
 
				 ```
			
 
				 
			
 
				 !!! note
			
 
				-    Gradio環境変数（`GRADIO_SHARE`、`GRADIO_SERVER_PORT`、`GRADIO_SERVER_NAME`など）を使用してWebUIを構成できます。
			
 
				+    Gradio 環境変数（`GRADIO_SHARE`、`GRADIO_SERVER_PORT`、`GRADIO_SERVER_NAME`など）を使用して WebUI を構成できます。
			
 
				 
			
 
				 お楽しみください！
			
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -18,7 +18,7 @@
 
				 此代码库根据 `BSD-3-Clause` 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
			
 
				 
			
 
				 <p align="center">
			
 
				-<img src="/docs/assets/figs/diagram.png" width="75%">
			
 
				+   <img src="/docs/assets/figs/diagram.png" width="75%">
			
 
				 </p>
			
 
				 
			
 
				 ## 要求
			
@@ -107,7 +107,7 @@ apt install libsox-dev
 
				 
			
 
				 ## 更新日志
			
 
				 
			
 
				-- 2024/07/02: 更新了 Fish-Speech 到 1.2 版本，移除 VITS Decoder，同时极大幅度提升zero-shot能力.
			
 
				+- 2024/07/02: 更新了 Fish-Speech 到 1.2 版本，移除 VITS Decoder，同时极大幅度提升 zero-shot 能力.
			
 
				 - 2024/05/10: 更新了 Fish-Speech 到 1.1 版本，引入了 VITS Decoder 来降低口胡和提高音色相似度.
			
 
				 - 2024/04/22: 完成了 Fish-Speech 1.0 版本, 大幅修改了 VQGAN 和 LLAMA 模型.
			
 
				 - 2023/12/28: 添加了 `lora` 微调支持.
			
--- a/docs/zh/inference.md
+++ b/docs/zh/inference.md
@@ -1,30 +1,30 @@
 
				 # 推理
			
 
				 
			
 
				-推理支持命令行, http api, 以及 webui 三种方式.  
			
 
				+推理支持命令行, http api, 以及 webui 三种方式.
			
 
				 
			
 
				 !!! note
			
 
				-    总的来说, 推理分为几个部分:  
			
 
				+    总的来说, 推理分为几个部分:
			
 
				 
			
 
				-    1. 给定一段 ~10 秒的语音, 将它用 VQGAN 编码.  
			
 
				-    2. 将编码后的语义 token 和对应文本输入语言模型作为例子.  
			
 
				-    3. 给定一段新文本, 让模型生成对应的语义 token.  
			
 
				-    4. 将生成的语义 token 输入 VQGAN 解码, 生成对应的语音.  
			
 
				+    1. 给定一段 ~10 秒的语音, 将它用 VQGAN 编码.
			
 
				+    2. 将编码后的语义 token 和对应文本输入语言模型作为例子.
			
 
				+    3. 给定一段新文本, 让模型生成对应的语义 token.
			
 
				+    4. 将生成的语义 token 输入 VQGAN 解码, 生成对应的语音.
			
 
				 
			
 
				 ## 命令行推理
			
 
				 
			
 
				 从我们的 huggingface 仓库下载所需的 `vqgan` 和 `llama` 模型。
			
 
				-    
			
 
				+
			
 
				 ```bash
			
 
				 huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				-对于中国大陆用户，可使用mirror下载。
			
 
				+对于中国大陆用户，可使用 mirror 下载。
			
 
				 
			
 
				 ```bash
			
 
				 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				-### 1. 从语音生成 prompt: 
			
 
				+### 1. 从语音生成 prompt:
			
 
				 
			
 
				 !!! note
			
 
				     如果你打算让模型随机选择音色, 你可以跳过这一步.
			
@@ -34,9 +34,11 @@ python tools/vqgan/inference.py \
 
				     -i "paimon.wav" \
			
 
				     --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
			
 
				 ```
			
 
				+
			
 
				 你应该能得到一个 `fake.npy` 文件.
			
 
				 
			
 
				-### 2. 从文本生成语义 token: 
			
 
				+### 2. 从文本生成语义 token:
			
 
				+
			
 
				 ```bash
			
 
				 python tools/llama/generate.py \
			
 
				     --text "要转换的文本" \
			
@@ -56,10 +58,10 @@ python tools/llama/generate.py \
 
				 !!! info
			
 
				     对于不支持 bf16 的 GPU, 你可能需要使用 `--half` 参数.
			
 
				 
			
 
				+### 3. 从语义 token 生成人声:
			
 
				 
			
 
				-### 3. 从语义 token 生成人声: 
			
 
				+#### VQGAN 解码
			
 
				 
			
 
				-#### VQGAN 解码 
			
 
				 ```bash
			
 
				 python tools/vqgan/inference.py \
			
 
				     -i "codes_0.npy" \
			
@@ -85,6 +87,65 @@ HF_ENDPOINT=https://hf-mirror.com python -m ...
 
				 
			
 
				 随后, 你可以在 `http://127.0.0.1:8000/` 中查看并测试 API.
			
 
				 
			
 
				+下面是使用`tools/post_api.py`发送请求的示例。
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.post_api \
			
 
				+    --text "要输入的文本" \
			
 
				+    --reference_audio "参考音频路径" \
			
 
				+    --reference_text "参考音频的文本内容"
			
 
				+    --streaming True
			
 
				+```
			
 
				+
			
 
				+上面的命令表示按照参考音频的信息，合成所需的音频并流式返回.
			
 
				+
			
 
				+如果需要通过`{说话人}`和`{情绪}`随机选择参考音频，那么就根据下列步骤配置：
			
 
				+
			
 
				+### 1. 在项目根目录创建`ref_data`文件夹.
			
 
				+
			
 
				+### 2. 在`ref_data`文件夹内创建类似如下结构的目录.
			
 
				+
			
 
				+```
			
 
				+.
			
 
				+├── SPEAKER1
			
 
				+│    ├──EMOTION1
			
 
				+│    │    ├── 21.15-26.44.lab
			
 
				+│    │    ├── 21.15-26.44.wav
			
 
				+│    │    ├── 27.51-29.98.lab
			
 
				+│    │    ├── 27.51-29.98.wav
			
 
				+│    │    ├── 30.1-32.71.lab
			
 
				+│    │    └── 30.1-32.71.flac
			
 
				+│    └──EMOTION2
			
 
				+│         ├── 30.1-32.71.lab
			
 
				+│         └── 30.1-32.71.mp3
			
 
				+└── SPEAKER2
			
 
				+    └─── EMOTION3
			
 
				+          ├── 30.1-32.71.lab
			
 
				+          └── 30.1-32.71.mp3
			
 
				+```
			
 
				+
			
 
				+也就是`ref_data`里先放`{说话人}`文件夹, 每个说话人下再放`{情绪}`文件夹, 每个情绪文件夹下放任意个`音频-文本对`。
			
 
				+
			
 
				+### 3. 在虚拟环境里输入
			
 
				+
			
 
				+```bash
			
 
				+python tools/gen_ref.py
			
 
				+```
			
 
				+
			
 
				+生成参考目录.
			
 
				+
			
 
				+### 4. 调用 api.
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.post_api \
			
 
				+    --text "要输入的文本" \
			
 
				+    --speaker "说话人1" \
			
 
				+    --emotion "情绪1" \
			
 
				+    --streaming True
			
 
				+```
			
 
				+
			
 
				+以上示例仅供测试.
			
 
				+
			
 
				 ## WebUI 推理
			
 
				 
			
 
				 你可以使用以下命令来启动 WebUI:
			
--- a/install_env.bat
+++ b/install_env.bat
@@ -263,11 +263,11 @@ if not "!install_packages!"=="" (
 
				 
			
 
				         if "!USE_MIRROR!"=="false" (
			
 
				             if "%%p"=="torch" (
			
 
				-                %PIP_CMD% install torch --index-url https://download.pytorch.org/whl/nightly/cu121 --no-warn-script-location
			
 
				+                %PIP_CMD% install torch --index-url https://download.pytorch.org/whl/cu121 --no-warn-script-location
			
 
				             ) else if "%%p"=="torchvision" (
			
 
				-                %PIP_CMD% install torchvision --index-url https://download.pytorch.org/whl/nightly/cu121 --no-warn-script-location
			
 
				+                %PIP_CMD% install torchvision --index-url https://download.pytorch.org/whl/cu121 --no-warn-script-location
			
 
				             ) else if "%%p"=="torchaudio" (
			
 
				-                %PIP_CMD% install torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --no-warn-script-location
			
 
				+                %PIP_CMD% install torchaudio --index-url https://download.pytorch.org/whl/cu121 --no-warn-script-location
			
 
				             ) else if "%%p"=="openai-whisper" (
			
 
				                 %PIP_CMD% install openai-whisper --no-warn-script-location
			
 
				             ) else if "%%p"=="fish-speech" (
			
--- a/tools/llama/generate.py
+++ b/tools/llama/generate.py
@@ -606,6 +606,7 @@ def launch_thread_safe_queue(
 
				     type=click.Path(path_type=Path, exists=True),
			
 
				     default="checkpoints/fish-speech-1.2",
			
 
				 )
			
 
				+@click.option("--device", type=str, default="cuda")
			
 
				 @click.option("--compile/--no-compile", default=False)
			
 
				 @click.option("--seed", type=int, default=42)
			
 
				 @click.option("--half/--no-half", default=False)
			
@@ -621,13 +622,13 @@ def main(
 
				     repetition_penalty: float,
			
 
				     temperature: float,
			
 
				     checkpoint_path: Path,
			
 
				+    device: str,
			
 
				     compile: bool,
			
 
				     seed: int,
			
 
				     half: bool,
			
 
				     iterative_prompt: bool,
			
 
				     chunk_length: int,
			
 
				 ) -> None:
			
 
				-    device = "cuda"
			
 
				 
			
 
				     precision = torch.half if half else torch.bfloat16
			
 
				 
			
--- a/tools/llama/quantize.py
+++ b/tools/llama/quantize.py
@@ -464,7 +464,8 @@ def quantize(checkpoint_path: Path, mode: str, groupsize: int, timestamp: str) -
 
				         dir_name = checkpoint_path
			
 
				         dst_name = Path(f"checkpoints/fs-1.2-int8-{now}")
			
 
				         shutil.copytree(str(dir_name.resolve()), str(dst_name.resolve()))
			
 
				-        (dst_name / vq_model).unlink()
			
 
				+        if (dst_name / vq_model).exists():
			
 
				+            (dst_name / vq_model).unlink()
			
 
				         quantize_path = dst_name / "model.pth"
			
 
				 
			
 
				     elif mode == "int4":
			
@@ -477,7 +478,8 @@ def quantize(checkpoint_path: Path, mode: str, groupsize: int, timestamp: str) -
 
				         dir_name = checkpoint_path
			
 
				         dst_name = Path(f"checkpoints/fs-1.2-int4-g{groupsize}-{now}")
			
 
				         shutil.copytree(str(dir_name.resolve()), str(dst_name.resolve()))
			
 
				-        (dst_name / vq_model).unlink()
			
 
				+        if (dst_name / vq_model).exists():
			
 
				+            (dst_name / vq_model).unlink()
			
 
				         quantize_path = dst_name / "model.pth"
			
 
				 
			
 
				     else:
			
--- a/tools/post_api.py
+++ b/tools/post_api.py
@@ -68,6 +68,7 @@ if __name__ == "__main__":
 
				     parser.add_argument(
			
 
				         "--speaker", type=str, default=None, help="Speaker ID for voice synthesis"
			
 
				     )
			
 
				+    parser.add_argument("--emotion", type=str, default=None, help="Speaker's Emotion")
			
 
				     parser.add_argument("--format", type=str, default="wav", help="Audio format")
			
 
				     parser.add_argument(
			
 
				         "--streaming", type=bool, default=False, help="Enable streaming response"
			
@@ -89,6 +90,7 @@ if __name__ == "__main__":
 
				         "repetition_penalty": args.repetition_penalty,
			
 
				         "temperature": args.temperature,
			
 
				         "speaker": args.speaker,
			
 
				+        "emotion": args.emotion,
			
 
				         "format": args.format,
			
 
				         "streaming": args.streaming,
			
 
				     }