2 年之前 · 90fc0be186
--- a/docs/en/finetune.md
+++ b/docs/en/finetune.md
@@ -2,22 +2,7 @@
 
															 Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
														
 
															-`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`, and `VITS`.
														
 
															-
														
 
															-!!! info 
														
 
															-    You should first conduct the following test to determine if you need to fine-tune `VITS Decoder`:
														
 
															-    ```bash
														
 
															-    python tools/vqgan/inference.py -i test.wav
														
 
															-    python tools/vits_decoder/inference.py \
														
 
															-        -ckpt checkpoints/vits_decoder_v1.1.ckpt \
														
 
															-        -i fake.npy -r test.wav \
														
 
															-        --text "The text you want to generate"
														
 
															-    ```
														
 
															-    This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VITS Decoder`.
														
 
															-
														
 
															-    Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
														
 
															-	
														
 
															-    It is recommended to fine-tune the LLAMA first, then fine-tune the `VITS Decoder` according to your needs.
														
 
															+In current version, you only need to finetune the 'LLAMA' part.
														
 
															 ## Fine-tuning LLAMA
														
 
															 ### 1. Prepare the dataset
														
@@ -51,7 +36,7 @@ You need to convert your dataset into the above format and place it under `data`
 
															 Make sure you have downloaded the VQGAN weights. If not, run the following command:
														
 
															 ```bash
														
 
															-huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
														
 
															+huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
														
 
															 ```
														
 
															 You can then run the following command to extract semantic tokens:
														
@@ -125,7 +110,7 @@ After training is complete, you can refer to the [inference](inference.md) secti
 
															     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
														
 
															     If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
														
 
															-#### Fine-tuning with LoRA
														
 
															+#### Fine-tuning with LoRA (recommend)
														
 
															 !!! note
														
 
															     LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets. 
														
@@ -143,109 +128,5 @@ python tools/llama/merge_lora.py \
 
															     --output checkpoints/merged.ckpt
														
 
															 ```
														
 
															-
														
 
															-## Fine-tuning VITS Decoder
														
 
															-### 1. Prepare the Dataset
														
 
															-
														
 
															-```
														
 
															-.
														
 
															-├── SPK1
														
 
															-│   ├── 21.15-26.44.lab
														
 
															-│   ├── 21.15-26.44.mp3
														
 
															-│   ├── 27.51-29.98.lab
														
 
															-│   ├── 27.51-29.98.mp3
														
 
															-│   ├── 30.1-32.71.lab
														
 
															-│   └── 30.1-32.71.mp3
														
 
															-└── SPK2
														
 
															-    ├── 38.79-40.85.lab
														
 
															-    └── 38.79-40.85.mp3
														
 
															-```
														
 
															-
														
 
															-!!! note
														
 
															-    VITS fine-tuning currently only supports `.lab` as the label file and does not support the `filelist` format.
														
 
															-
														
 
															-You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions, and the annotation files should have the `.lab` extension.
														
 
															-
														
 
															-### 2. Split Training and Validation Sets
														
 
															-
														
 
															-```bash
														
 
															-python tools/vqgan/create_train_split.py data
														
 
															-```
														
 
															-
														
 
															-This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
														
 
															-
														
 
															-!!! info
														
 
															-    For the VITS format, you can specify a file list using `--filelist xxx.list`.  
														
 
															-    Please note that the audio files in `filelist` must also be located in the `data` folder.
														
 
															-
														
 
															-### 3. Start Training
														
 
															-
														
 
															-```bash
														
 
															-python fish_speech/train.py --config-name vits_decoder_finetune
														
 
															-```
														
 
															-
														
 
															-!!! note
														
 
															-    You can modify training parameters by editing `fish_speech/configs/vits_decoder_finetune.yaml`, but in most cases, this won't be necessary.
														
 
															-
														
 
															-### 4. Test the Audio
														
 
															-    
														
 
															-```bash
														
 
															-python tools/vits_decoder/inference.py \
														
 
															-    --checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \
														
 
															-    -i test.npy -r test.wav \
														
 
															-    --text "The text you want to generate"
														
 
															-```
														
 
															-
														
 
															-You can review `fake.wav` to assess the fine-tuning results.
														
 
															-
														
 
															-
														
 
															-## Fine-tuning VQGAN (Not Recommended)
														
 
															-
														
 
															-
														
 
															-We no longer recommend using VQGAN for fine-tuning in version 1.1. Using VITS Decoder will yield better results, but if you still want to fine-tune VQGAN, you can refer to the following steps.
														
 
															-
														
 
															-### 1. Prepare the Dataset
														
 
															-
														
 
															-```
														
 
															-.
														
 
															-├── SPK1
														
 
															-│   ├── 21.15-26.44.mp3
														
 
															-│   ├── 27.51-29.98.mp3
														
 
															-│   └── 30.1-32.71.mp3
														
 
															-└── SPK2
														
 
															-    └── 38.79-40.85.mp3
														
 
															-```
														
 
															-
														
 
															-You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
														
 
															-
														
 
															-### 2. Split Training and Validation Sets
														
 
															-
														
 
															-```bash
														
 
															-python tools/vqgan/create_train_split.py data
														
 
															-```
														
 
															-
														
 
															-This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
														
 
															-
														
 
															-!!!info
														
 
															-    For the VITS format, you can specify a file list using `--filelist xxx.list`.  
														
 
															-    Please note that the audio files in `filelist` must also be located in the `data` folder.
														
 
															-
														
 
															-### 3. Start Training
														
 
															-
														
 
															-```bash
														
 
															-python fish_speech/train.py --config-name firefly_gan_vq
														
 
															-```
														
 
															-
														
 
															-!!! note
														
 
															-    You can modify training parameters by editing `fish_speech/configs/firefly_gan_vq.yaml`, but in most cases, this won't be necessary.
														
 
															-
														
 
															-### 4. Test the Audio
														
 
															-    
														
 
															-```bash
														
 
															-python tools/vqgan/inference.py -i test.wav --checkpoint-path results/firefly_gan_vq/checkpoints/step_000010000.ckpt
														
 
															-```
														
 
															-
														
 
															-You can review `fake.wav` to assess the fine-tuning results.
														
 
															-
														
 
															 !!! note
														
 
															     You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
														
--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -107,6 +107,7 @@ apt install libsox-dev
 
															 ## Changelog
														
 
															+- 2024/07/02: Updated Fish-Speech to 1.2 version, remove VITS Decoder, and greatly enhanced zero-shot ability.
														
 
															 - 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity.
														
 
															 - 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models.
														
 
															 - 2023/12/28: Added `lora` fine-tuning support.
														
--- a/docs/en/inference.md
+++ b/docs/en/inference.md
@@ -10,17 +10,13 @@ Inference support command line, HTTP API and web UI.
 
															     3. Given a new piece of text, let the model generate the corresponding semantic tokens.
														
 
															     4. Input the generated semantic tokens into VITS / VQGAN to decode and generate the corresponding voice.
														
 
															-In version 1.1, we recommend using VITS for decoding, as it performs better than VQGAN in both timbre and pronunciation.
														
 
															-
														
 
															 ## Command Line Inference
														
 
															-Download the required `vqgan` and `text2semantic` models from our Hugging Face repository.
														
 
															+Download the required `vqgan` and `llama` models from our Hugging Face repository.
														
 
															 ```bash
														
 
															-huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
														
 
															-huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
														
 
															-huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
														
 
															-huggingface-cli download fishaudio/fish-speech-1 firefly-gan-base-generator.ckpt --local-dir checkpoints
														
 
															+huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
														
 
															+huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
														
 
															 ```
														
 
															 ### 1. Generate prompt from voice:
														
@@ -42,7 +38,7 @@ python tools/llama/generate.py \
 
															     --prompt-text "Your reference text" \
														
 
															     --prompt-tokens "fake.npy" \
														
 
															     --config-name dual_ar_2_codebook_medium \
														
 
															-    --checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
														
 
															+    --checkpoint-path "checkpoints/model.pth" \
														
 
															     --num-samples 2 \
														
 
															     --compile
														
 
															 ```
														
@@ -61,14 +57,6 @@ This command will create a `codes_N` file in the working directory, where N is a
 
															 ### 3. Generate vocals from semantic tokens:
														
 
															-#### VITS Decoder
														
 
															-```bash
														
 
															-python tools/vits_decoder/inference.py \
														
 
															-    --checkpoint-path checkpoints/vits_decoder_v1.1.ckpt \
														
 
															-    -i codes_0.npy -r ref.wav \
														
 
															-    --text "The text you want to generate"
														
 
															-```
														
 
															-
														
 
															 #### VQGAN Decoder (not recommended)
														
 
															 ```bash
														
 
															 python tools/vqgan/inference.py \
														
@@ -83,42 +71,25 @@ We provide a HTTP API for inference. You can use the following command to start
 
															 ```bash
														
 
															 python -m tools.api \
														
 
															     --listen 0.0.0.0:8000 \
														
 
															-    --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
														
 
															-    --llama-config-name dual_ar_2_codebook_medium \
														
 
															+    --llama-checkpoint-path "checkpoints/model.pth" \
														
 
															+    --llama-config-name dual_ar_4_codebook_medium \
														
 
															     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
														
 
															     --decoder-config-name firefly_gan_vq
														
 
															-```
														
 
															 After that, you can view and test the API at http://127.0.0.1:8000/.  
														
 
															-!!! info
														
 
															-    You should use following parameters to start VITS decoder:
														
 
															-
														
 
															-    ```bash
														
 
															-    --decoder-config-name vits_decoder_finetune \
														
 
															-    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
														
 
															-    ```
														
 
															-
														
 
															 ## WebUI Inference
														
 
															 You can start the WebUI using the following command:
														
 
															 ```bash
														
 
															 python -m tools.webui \
														
 
															-    --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
														
 
															-    --llama-config-name dual_ar_2_codebook_medium \
														
 
															-    --vqgan-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
														
 
															-    --vits-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt"
														
 
															+    --llama-checkpoint-path "checkpoints/model.pth" \
														
 
															+    --llama-config-name dual_ar_4_codebook_medium \
														
 
															+    --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
														
 
															+    --decoder-config-name firefly_gan_vq
														
 
															 ```
														
 
															-!!! info
														
 
															-    You should use following parameters to start VITS decoder:
														
 
															-
														
 
															-    ```bash
														
 
															-    --decoder-config-name vits_decoder_finetune \
														
 
															-    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
														
 
															-    ```
														
 
															-
														
 
															 !!! note
														
 
															     You can use Gradio environment variables, such as `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` to configure WebUI.
														
--- a/docs/zh/finetune.md
+++ b/docs/zh/finetune.md
@@ -1,23 +1,8 @@
 
															 # 微调
														
 
															-显然, 当你打开这个页面的时候, 你已经对预训练模型 few-shot 的效果不算满意. 你想要微调一个模型, 使得它在你的数据集上表现更好.  
														
 
															+显然, 当你打开这个页面的时候, 你已经对预训练模型 zero-shot 的效果不算满意. 你想要微调一个模型, 使得它在你的数据集上表现更好.  
														
 
															-`Fish Speech` 由三个模块组成: `VQGAN`,`LLAMA`, 以及 `VITS Decoder`. 
														
 
															-
														
 
															-!!! info 
														
 
															-    你应该先进行如下测试来判断你是否需要微调 `VITS Decoder`
														
 
															-    ```bash
														
 
															-    python tools/vqgan/inference.py -i test.wav
														
 
															-    python tools/vits_decoder/inference.py \
														
 
															-        -ckpt checkpoints/vits_decoder_v1.1.ckpt \
														
 
															-        -i fake.npy -r test.wav \
														
 
															-        --text "合成文本"
														
 
															-    ```
														
 
															-    该测试会生成一个 `fake.wav` 文件, 如果该文件的音色和说话人的音色不同, 或者质量不高, 你需要微调 `VITS Decoder`.
														
 
															-
														
 
															-    相应的, 你可以参考 [推理](inference.md) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
														
 
															-
														
 
															-    建议先对 `LLAMA` 进行微调，最后再根据需要微调 `VITS Decoder`.
														
 
															+在目前版本，你只需要微调'LLAMA'部分即可.
														
 
															 ## LLAMA 微调
														
 
															 ### 1. 准备数据集
														
@@ -49,13 +34,13 @@
 
															 确保你已经下载了 vqgan 权重, 如果没有, 请运行以下命令:
														
 
															 ```bash
														
 
															-huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
														
 
															+huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
														
 
															 ```
														
 
															 对于中国大陆用户, 可使用 mirror 下载.
														
 
															 ```bash
														
 
															-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
														
 
															+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
														
 
															 ```
														
 
															 随后可运行以下命令来提取语义 token:
														
@@ -108,13 +93,13 @@ python tools/llama/build_dataset.py \
 
															 同样的, 请确保你已经下载了 `LLAMA` 权重, 如果没有, 请运行以下命令:
														
 
															 ```bash
														
 
															-huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
														
 
															+huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
														
 
															 ```
														
 
															 对于中国大陆用户, 可使用 mirror 下载.
														
 
															 ```bash
														
 
															-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
														
 
															+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
														
 
															 ```
														
 
															 最后, 你可以运行以下命令来启动微调:
														
@@ -136,7 +121,7 @@ python fish_speech/train.py --config-name text2semantic_finetune \
 
															     默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.  
														
 
															     如果你想要学到音色, 请将训练步数调大, 但这有可能会导致过拟合.
														
 
															-#### 使用 LoRA 进行微调
														
 
															+#### 使用 LoRA 进行微调（建议）
														
 
															 !!! note
														
 
															     LoRA 可以减少模型过拟合的风险, 但是相应的会导致在大数据集上欠拟合.   
														
@@ -153,105 +138,5 @@ python tools/llama/merge_lora.py \
 
															     --output checkpoints/merged.ckpt
														
 
															 ```
														
 
															-## VITS 微调
														
 
															-### 1. 准备数据集
														
 
															-
														
 
															-```
														
 
															-.
														
 
															-├── SPK1
														
 
															-│   ├── 21.15-26.44.lab
														
 
															-│   ├── 21.15-26.44.mp3
														
 
															-│   ├── 27.51-29.98.lab
														
 
															-│   ├── 27.51-29.98.mp3
														
 
															-│   ├── 30.1-32.71.lab
														
 
															-│   └── 30.1-32.71.mp3
														
 
															-└── SPK2
														
 
															-    ├── 38.79-40.85.lab
														
 
															-    └── 38.79-40.85.mp3
														
 
															-```
														
 
															-!!! note
														
 
															-	VITS 微调目前仅支持 `.lab` 作为标签文件，不支持 `filelist` 形式.
														
 
															-
														
 
															-你需要将数据集转为以上格式, 并放到 `data` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`, 标注文件后缀建议为 `.lab`.
														
 
															-
														
 
															-### 2. 分割训练集和验证集
														
 
															-
														
 
															-```bash
														
 
															-python tools/vqgan/create_train_split.py data
														
 
															-```
														
 
															-
														
 
															-该命令会在 `data` 目录下创建 `data/vq_train_filelist.txt` 和 `data/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
														
 
															-
														
 
															-!!! info
														
 
															-    对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.  
														
 
															-    请注意, `filelist` 所指向的音频文件必须也位于 `data` 文件夹下.
														
 
															-
														
 
															-### 3. 启动训练
														
 
															-
														
 
															-```bash
														
 
															-python fish_speech/train.py --config-name vits_decoder_finetune
														
 
															-```
														
 
															-
														
 
															-!!! note
														
 
															-    你可以通过修改 `fish_speech/configs/vits_decoder_finetune.yaml` 来修改训练参数, 如数据集配置.
														
 
															-
														
 
															-### 4. 测试音频
														
 
															-    
														
 
															-```bash
														
 
															-python tools/vits_decoder/inference.py \
														
 
															-    --checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \
														
 
															-    -i test.npy -r test.wav \
														
 
															-    --text "合成文本"
														
 
															-```
														
 
															-
														
 
															-你可以查看 `fake.wav` 来判断微调效果.
														
 
															-
														
 
															-## VQGAN 微调 (不推荐)
														
 
															-
														
 
															-在 V1.1 版本中, 我们不再推荐使用 VQGAN 进行微调, 使用 VITS Decoder 会获得更好的表现, 但是如果你仍然想要使用 VQGAN 进行微调, 你可以参考以下步骤.
														
 
															-
														
 
															-### 1. 准备数据集
														
 
															-
														
 
															-```
														
 
															-.
														
 
															-├── SPK1
														
 
															-│   ├── 21.15-26.44.mp3
														
 
															-│   ├── 27.51-29.98.mp3
														
 
															-│   └── 30.1-32.71.mp3
														
 
															-└── SPK2
														
 
															-    └── 38.79-40.85.mp3
														
 
															-```
														
 
															-
														
 
															-你需要将数据集转为以上格式, 并放到 `data` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
														
 
															-
														
 
															-### 2. 分割训练集和验证集
														
 
															-
														
 
															-```bash
														
 
															-python tools/vqgan/create_train_split.py data
														
 
															-```
														
 
															-
														
 
															-该命令会在 `data` 目录下创建 `data/vq_train_filelist.txt` 和 `data/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
														
 
															-
														
 
															-!!!info
														
 
															-    对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.  
														
 
															-    请注意, `filelist` 所指向的音频文件必须也位于 `data` 文件夹下.
														
 
															-
														
 
															-### 3. 启动训练
														
 
															-
														
 
															-```bash
														
 
															-python fish_speech/train.py --config-name firefly_gan_vq
														
 
															-```
														
 
															-
														
 
															-!!! note
														
 
															-    你可以通过修改 `fish_speech/configs/firefly_gan_vq.yaml` 来修改训练参数, 但大部分情况下, 你不需要这么做.
														
 
															-
														
 
															-### 4. 测试音频
														
 
															-    
														
 
															-```bash
														
 
															-python tools/vqgan/inference.py -i test.wav --checkpoint-path results/firefly_gan_vq/checkpoints/step_000010000.ckpt
														
 
															-```
														
 
															-
														
 
															-你可以查看 `fake.wav` 来判断微调效果.
														
 
															-
														
 
															 !!! note
														
 
															     你也可以尝试其他的 checkpoint, 我们建议你使用最早的满足你要求的 checkpoint, 他们通常在 OOD 上表现更好.
														
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -107,6 +107,7 @@ apt install libsox-dev
 
															 ## 更新日志
														
 
															+- 2024/07/02: 更新了 Fish-Speech 到 1.2 版本，移除 VITS Decoder，同时极大幅度提升zero-shot能力.
														
 
															 - 2024/05/10: 更新了 Fish-Speech 到 1.1 版本，引入了 VITS Decoder 来降低口胡和提高音色相似度.
														
 
															 - 2024/04/22: 完成了 Fish-Speech 1.0 版本, 大幅修改了 VQGAN 和 LLAMA 模型.
														
 
															 - 2023/12/28: 添加了 `lora` 微调支持.
														
--- a/docs/zh/inference.md
+++ b/docs/zh/inference.md
@@ -8,29 +8,22 @@
 
															     1. 给定一段 ~10 秒的语音, 将它用 VQGAN 编码.  
														
 
															     2. 将编码后的语义 token 和对应文本输入语言模型作为例子.  
														
 
															     3. 给定一段新文本, 让模型生成对应的语义 token.  
														
 
															-    4. 将生成的语义 token 输入 VITS / VQGAN 解码, 生成对应的语音.  
														
 
															-
														
 
															-在 V1.1 版本中, 我们推荐优先使用 VITS 解码器, 因为它在音质和口胡上都有更好的表现.
														
 
															+    4. 将生成的语义 token 输入 VQGAN 解码, 生成对应的语音.  
														
 
															 ## 命令行推理
														
 
															-从我们的 huggingface 仓库下载所需的 `vqgan` 和 `text2semantic` 模型。
														
 
															+从我们的 huggingface 仓库下载所需的 `vqgan` 和 `llama` 模型。
														
 
															 ```bash
														
 
															-huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
														
 
															-huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
														
 
															-huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
														
 
															-huggingface-cli download fishaudio/fish-speech-1 firefly-gan-base-generator.ckpt --local-dir checkpoints
														
 
															+huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
														
 
															+huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
														
 
															 ```
														
 
															 对于中国大陆用户，可使用mirror下载。
														
 
															 ```bash
														
 
															-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
														
 
															-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
														
 
															-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
														
 
															-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 firefly-gan-base-generator.ckpt --local-dir checkpoints
														
 
															-
														
 
															+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
														
 
															+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
														
 
															 ```
														
 
															 ### 1. 从语音生成 prompt: 
														
@@ -52,7 +45,7 @@ python tools/llama/generate.py \
 
															     --prompt-text "你的参考文本" \
														
 
															     --prompt-tokens "fake.npy" \
														
 
															     --config-name dual_ar_2_codebook_medium \
														
 
															-    --checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
														
 
															+    --checkpoint-path "checkpoints/model.pth" \
														
 
															     --num-samples 2 \
														
 
															     --compile
														
 
															 ```
														
@@ -71,15 +64,7 @@ python tools/llama/generate.py \
 
															 ### 3. 从语义 token 生成人声: 
														
 
															-#### VITS 解码
														
 
															-```bash
														
 
															-python tools/vits_decoder/inference.py \
														
 
															-    --checkpoint-path checkpoints/vits_decoder_v1.1.ckpt \
														
 
															-    -i codes_0.npy -r ref.wav \
														
 
															-    --text "要生成的文本"
														
 
															-```
														
 
															-
														
 
															-#### VQGAN 解码 (不推荐)
														
 
															+#### VQGAN 解码 
														
 
															 ```bash
														
 
															 python tools/vqgan/inference.py \
														
 
															     -i "codes_0.npy" \
														
@@ -93,8 +78,8 @@ python tools/vqgan/inference.py \
 
															 ```bash
														
 
															 python -m tools.api \
														
 
															     --listen 0.0.0.0:8000 \
														
 
															-    --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
														
 
															-    --llama-config-name dual_ar_2_codebook_medium \
														
 
															+    --llama-checkpoint-path "checkpoints/model.pth" \
														
 
															+    --llama-config-name dual_ar_4_codebook_medium \
														
 
															     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
														
 
															     --decoder-config-name firefly_gan_vq
														
@@ -104,34 +89,18 @@ HF_ENDPOINT=https://hf-mirror.com python -m ...
 
															 随后, 你可以在 `http://127.0.0.1:8000/` 中查看并测试 API.
														
 
															-!!! info
														
 
															-    你应该使用以下参数来启动 VITS 解码器:
														
 
															-
														
 
															-    ```bash
														
 
															-    --decoder-config-name vits_decoder_finetune \
														
 
															-    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # 或者你自己的模型
														
 
															-    ```
														
 
															-
														
 
															 ## WebUI 推理
														
 
															 你可以使用以下命令来启动 WebUI:
														
 
															 ```bash
														
 
															 python -m tools.webui \
														
 
															-    --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
														
 
															-    --llama-config-name dual_ar_2_codebook_medium \
														
 
															+    --llama-checkpoint-path "checkpoints/model.pth" \
														
 
															+    --llama-config-name dual_ar_4_codebook_medium \
														
 
															     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
														
 
															     --decoder-config-name firefly_gan_vq
														
 
															 ```
														
 
															-!!! info
														
 
															-    你应该使用以下参数来启动 VITS 解码器:
														
 
															-
														
 
															-    ```bash
														
 
															-    --decoder-config-name vits_decoder_finetune \
														
 
															-    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # 或者你自己的模型
														
 
															-    ```
														
 
															-
														
 
															 !!! note
														
 
															     你可以使用 Gradio 环境变量, 如 `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` 来配置 WebUI.