1 year ago · 16ea505b2a
--- a/docs/en/finetune.md
+++ b/docs/en/finetune.md
@@ -36,7 +36,7 @@ You need to convert your dataset into the above format and place it under `data`
 
				 Make sure you have downloaded the VQGAN weights. If not, run the following command:
			
 
				 
			
 
				 ```bash
			
 
				-huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
			
 
				+huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 You can then run the following command to extract semantic tokens:
			
@@ -77,25 +77,27 @@ This command will create `.npy` files in the `data` directory, as shown below:
 
				 ```bash
			
 
				 python tools/llama/build_dataset.py \
			
 
				     --input "data" \
			
 
				-    --output "data/quantized-dataset-ft.protos" \
			
 
				+    --output "data/protos" \
			
 
				     --text-extension .lab \
			
 
				     --num-workers 16
			
 
				 ```
			
 
				 
			
 
				 After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
			
 
				 
			
 
				-### 4. Finally, start the fine-tuning
			
 
				+### 4. Finally, fine-tuning with LoRA
			
 
				 
			
 
				 Similarly, make sure you have downloaded the `LLAMA` weights. If not, run the following command:
			
 
				 
			
 
				 ```bash
			
 
				-huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
			
 
				+huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 Finally, you can start the fine-tuning by running the following command:
			
 
				+
			
 
				 ```bash
			
 
				 python fish_speech/train.py --config-name text2semantic_finetune \
			
 
				-    model@model.model=dual_ar_2_codebook_medium
			
 
				+    project=$project \
			
 
				+    +lora@model.model.lora_config=r_8_alpha_16
			
 
				 ```
			
 
				 
			
 
				 !!! note
			
@@ -110,23 +112,14 @@ After training is complete, you can refer to the [inference](inference.md) secti
 
				     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
			
 
				     If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
			
 
				 
			
 
				-#### Fine-tuning with LoRA (recommend)
			
 
				-
			
 
				-!!! note
			
 
				-    LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets. 
			
 
				-
			
 
				-If you want to use LoRA, please add the following parameter: `+lora@model.lora_config=r_8_alpha_16`. 
			
 
				-
			
 
				 After training, you need to convert the LoRA weights to regular weights before performing inference.
			
 
				 
			
 
				 ```bash
			
 
				 python tools/llama/merge_lora.py \
			
 
				-    --llama-config dual_ar_2_codebook_medium \
			
 
				-    --lora-config r_8_alpha_16 \
			
 
				-    --llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
			
 
				-    --lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
			
 
				-    --output checkpoints/merged.ckpt
			
 
				+	--lora-config r_8_alpha_16 \
			
 
				+	--base-weight checkpoints/fish-speech-1.2 \
			
 
				+	--lora-weight results/$project/checkpoints/step_000000010.ckpt \
			
 
				+	--output checkpoints/fish-speech-1.2-yth-lora/
			
 
				 ```
			
 
				-
			
 
				 !!! note
			
 
				     You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
			
--- a/docs/en/inference.md
+++ b/docs/en/inference.md
@@ -15,8 +15,7 @@ Inference support command line, HTTP API and web UI.
 
				 Download the required `vqgan` and `llama` models from our Hugging Face repository.
			
 
				     
			
 
				 ```bash
			
 
				-huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
			
 
				-huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
			
 
				+huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 ### 1. Generate prompt from voice:
			
@@ -37,8 +36,7 @@ python tools/llama/generate.py \
 
				     --text "The text you want to convert" \
			
 
				     --prompt-text "Your reference text" \
			
 
				     --prompt-tokens "fake.npy" \
			
 
				-    --config-name dual_ar_2_codebook_medium \
			
 
				-    --checkpoint-path "checkpoints/model.pth" \
			
 
				+    --checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --num-samples 2 \
			
 
				     --compile
			
 
				 ```
			
@@ -71,11 +69,12 @@ We provide a HTTP API for inference. You can use the following command to start
 
				 ```bash
			
 
				 python -m tools.api \
			
 
				     --listen 0.0.0.0:8000 \
			
 
				-    --llama-checkpoint-path "checkpoints/model.pth" \
			
 
				-    --llama-config-name dual_ar_4_codebook_medium \
			
 
				+    --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
			
 
				     --decoder-config-name firefly_gan_vq
			
 
				 
			
 
				+If you want to speed up inference, you can add the --compile parameter.
			
 
				+
			
 
				 After that, you can view and test the API at http://127.0.0.1:8000/.  
			
 
				 
			
 
				 ## WebUI Inference
			
@@ -84,8 +83,7 @@ You can start the WebUI using the following command:
 
				 
			
 
				 ```bash
			
 
				 python -m tools.webui \
			
 
				-    --llama-checkpoint-path "checkpoints/model.pth" \
			
 
				-    --llama-config-name dual_ar_4_codebook_medium \
			
 
				+    --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
			
 
				     --decoder-config-name firefly_gan_vq
			
 
				 ```
			
--- a/docs/zh/finetune.md
+++ b/docs/zh/finetune.md
@@ -34,13 +34,13 @@
 
				 确保你已经下载了 vqgan 权重, 如果没有, 请运行以下命令:
			
 
				 
			
 
				 ```bash
			
 
				-huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
			
 
				+huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 对于中国大陆用户, 可使用 mirror 下载.
			
 
				 
			
 
				 ```bash
			
 
				-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
			
 
				+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 随后可运行以下命令来提取语义 token:
			
@@ -80,33 +80,34 @@ python tools/vqgan/extract_vq.py data \
 
				 ```bash
			
 
				 python tools/llama/build_dataset.py \
			
 
				     --input "data" \
			
 
				-    --output "data/quantized-dataset-ft.protos" \
			
 
				+    --output "data/protos" \
			
 
				     --text-extension .lab \
			
 
				     --num-workers 16
			
 
				 ```
			
 
				 
			
 
				-命令执行完毕后, 你应该能在 `data` 目录下看到 `quantized-dataset-ft.protos` 文件.
			
 
				+命令执行完毕后, 你应该能在 `data` 目录下看到 `protos` 文件.
			
 
				 
			
 
				 
			
 
				-### 4. 最后, 启动微调
			
 
				+### 4. 最后, 使用 LoRA 进行微调
			
 
				 
			
 
				 同样的, 请确保你已经下载了 `LLAMA` 权重, 如果没有, 请运行以下命令:
			
 
				 
			
 
				 ```bash
			
 
				-huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
			
 
				+huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 对于中国大陆用户, 可使用 mirror 下载.
			
 
				 
			
 
				 ```bash
			
 
				-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
			
 
				+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 最后, 你可以运行以下命令来启动微调:
			
 
				 
			
 
				 ```bash
			
 
				 python fish_speech/train.py --config-name text2semantic_finetune \
			
 
				-    model@model.model=dual_ar_2_codebook_medium
			
 
				+    project=$project \
			
 
				+    +lora@model.model.lora_config=r_8_alpha_16
			
 
				 ```
			
 
				 
			
 
				 !!! note
			
@@ -119,23 +120,16 @@ python fish_speech/train.py --config-name text2semantic_finetune \
 
				 
			
 
				 !!! info
			
 
				     默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.  
			
 
				-    如果你想要学到音色, 请将训练步数调大, 但这有可能会导致过拟合.
			
 
				-
			
 
				-#### 使用 LoRA 进行微调（建议）
			
 
				-!!! note
			
 
				-    LoRA 可以减少模型过拟合的风险, 但是相应的会导致在大数据集上欠拟合.   
			
 
				-
			
 
				-如果你想使用 LoRA, 请添加以下参数 `+lora@model.lora_config=r_8_alpha_16`.  
			
 
				+    如果你想要学到音色, 请将训练步数调大, 但这有可能会导致过拟合. 
			
 
				 
			
 
				 训练完成后, 你需要先将 loRA 的权重转为普通权重, 然后再进行推理.
			
 
				 
			
 
				 ```bash
			
 
				 python tools/llama/merge_lora.py \
			
 
				-    --llama-config dual_ar_2_codebook_medium \
			
 
				-    --lora-config r_8_alpha_16 \
			
 
				-    --llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
			
 
				-    --lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
			
 
				-    --output checkpoints/merged.ckpt
			
 
				+	--lora-config r_8_alpha_16 \
			
 
				+	--base-weight checkpoints/fish-speech-1.2 \
			
 
				+	--lora-weight results/$project/checkpoints/step_000000010.ckpt \
			
 
				+	--output checkpoints/fish-speech-1.2-yth-lora/
			
 
				 ```
			
 
				 
			
 
				 !!! note
			
--- a/docs/zh/inference.md
+++ b/docs/zh/inference.md
@@ -15,15 +15,13 @@
 
				 从我们的 huggingface 仓库下载所需的 `vqgan` 和 `llama` 模型。
			
 
				     
			
 
				 ```bash
			
 
				-huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
			
 
				-huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
			
 
				+huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 对于中国大陆用户，可使用mirror下载。
			
 
				 
			
 
				 ```bash
			
 
				-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
			
 
				-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
			
 
				+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
			
 
				 ```
			
 
				 
			
 
				 ### 1. 从语音生成 prompt: 
			
@@ -44,8 +42,7 @@ python tools/llama/generate.py \
 
				     --text "要转换的文本" \
			
 
				     --prompt-text "你的参考文本" \
			
 
				     --prompt-tokens "fake.npy" \
			
 
				-    --config-name dual_ar_2_codebook_medium \
			
 
				-    --checkpoint-path "checkpoints/model.pth" \
			
 
				+    --checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --num-samples 2 \
			
 
				     --compile
			
 
				 ```
			
@@ -78,11 +75,12 @@ python tools/vqgan/inference.py \
 
				 ```bash
			
 
				 python -m tools.api \
			
 
				     --listen 0.0.0.0:8000 \
			
 
				-    --llama-checkpoint-path "checkpoints/model.pth" \
			
 
				-    --llama-config-name dual_ar_4_codebook_medium \
			
 
				+    --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
			
 
				     --decoder-config-name firefly_gan_vq
			
 
				 
			
 
				+如果你想要加速推理，可以加上--compile参数。
			
 
				+
			
 
				 # 推荐中国大陆用户运行以下命令来启动 HTTP 服务:
			
 
				 HF_ENDPOINT=https://hf-mirror.com python -m ...
			
 
				 ```
			
@@ -95,8 +93,7 @@ HF_ENDPOINT=https://hf-mirror.com python -m ...
 
				 
			
 
				 ```bash
			
 
				 python -m tools.webui \
			
 
				-    --llama-checkpoint-path "checkpoints/model.pth" \
			
 
				-    --llama-config-name dual_ar_4_codebook_medium \
			
 
				+    --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
			
 
				     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
			
 
				     --decoder-config-name firefly_gan_vq
			
 
				 ```