Explorar o código

Update documents

Lengyue hai 1 ano
pai
achega
ad7187ec11

+ 1 - 1
README.md

@@ -8,7 +8,7 @@
 <img alt="QQ" src="https://img.shields.io/badge/QQ Group-%2312B7F5?logo=tencent-qq&logoColor=white&style=flat-square"/>
 </a>
 <a target="_blank" href="https://hub.docker.com/r/lengyue233/fish-speech">
-<img alt="Docker" src="https://img.shields.io/docker/automated/lengyue233/fish-speech&style=flat-square"/>
+<img alt="Docker" src="https://img.shields.io/docker/pulls/lengyue233/fish-speech?style=flat-square&logo=docker"/>
 </a>
 </div>
 

BIN=BIN
docs/assets/figs/diagram.png


+ 28 - 37
docs/en/finetune.md

@@ -26,19 +26,19 @@ Obviously, when you opened this page, you were not satisfied with the performanc
     └── 38.79-40.85.mp3
 ```
 
-You need to format your dataset as shown above and place it under `data/demo`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
+You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
 
 ### 2. Split Training and Validation Sets
 
 ```bash
-python tools/vqgan/create_train_split.py data/demo
+python tools/vqgan/create_train_split.py data
 ```
 
-This command will create `data/demo/vq_train_filelist.txt` and `data/demo/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
+This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
 
 !!!info
     For the VITS format, you can specify a file list using `--filelist xxx.list`.  
-    Please note that the audio files in `filelist` must also be located in the `data/demo` folder.
+    Please note that the audio files in `filelist` must also be located in the `data` folder.
 
 ### 3. Start Training
 
@@ -77,33 +77,38 @@ You can review `fake.wav` to assess the fine-tuning results.
     └── 38.79-40.85.mp3
 ```
 
-You need to convert your dataset into the above format and place it under `data/demo`. The audio file can have the extensions `.mp3`, `.wav`, or `.flac`, and the annotation file can have the extensions `.lab` or `.txt`.
+You need to convert your dataset into the above format and place it under `data`. The audio file can have the extensions `.mp3`, `.wav`, or `.flac`, and the annotation file should have the extensions `.lab`.
+
+!!! warning
+    It's recommended to apply loudness normalization to the dataset. You can use [fish-audio-preprocess](https://github.com/fishaudio/audio-preprocess) to do this.
+
+    ```bash
+    fap loudness-norm data-raw data --clean
+    ```
 
-!!! note
-    You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
 
 ### 2. Batch extraction of semantic tokens
 
 Make sure you have downloaded the VQGAN weights. If not, run the following command:
 
 ```bash
-huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
 ```
 
 You can then run the following command to extract semantic tokens:
 
 ```bash
-python tools/vqgan/extract_vq.py data/demo \
+python tools/vqgan/extract_vq.py data \
     --num-workers 1 --batch-size 16 \
     --config-name "vqgan_pretrain" \
-    --checkpoint-path "checkpoints/vqgan-v1.pth"
+    --checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 
 !!! note
     You can adjust `--num-workers` and `--batch-size` to increase extraction speed, but please make sure not to exceed your GPU memory limit.  
     For the VITS format, you can specify a file list using `--filelist xxx.list`.
 
-This command will create `.npy` files in the `data/demo` directory, as shown below:
+This command will create `.npy` files in the `data` directory, as shown below:
 
 ```
 .
@@ -127,8 +132,10 @@ This command will create `.npy` files in the `data/demo` directory, as shown bel
 
 ```bash
 python tools/llama/build_dataset.py \
-    --config "fish_speech/configs/data/finetune.yaml" \
-    --output "data/quantized-dataset-ft.protos"
+    --input "data" \
+    --output "data/quantized-dataset-ft.protos" \
+    --text-extension .lab \
+    --num-workers 16
 ```
 
 After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
@@ -136,45 +143,29 @@ After the command finishes executing, you should see the `quantized-dataset-ft.p
 !!!info
     For the VITS format, you can specify a file list using `--filelist xxx.list`.
 
-### 4. Start the Rust data server
-
-Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
-
-```bash
-cd data_server
-cargo build --release
-```
-
-After the compilation is complete, you can start the server using the following command:
-
-```bash
-export RUST_LOG=info # Optional, for debugging
-data_server/target/release/data_server \
-    --files "data/quantized-dataset-ft.protos" 
-```
-
-!!! note
-    You can specify multiple `--files` parameters to load multiple datasets.
-
-### 5. Finally, start the fine-tuning
+### 4. Finally, start the fine-tuning
 
 Similarly, make sure you have downloaded the `LLAMA` weights. If not, run the following command:
 
 ```bash
-huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 text2semantic-large-v1-4k.pth --local-dir checkpoints
 ```
 
 Finally, you can start the fine-tuning by running the following command:
 ```bash
-python fish_speech/train.py --config-name text2semantic_finetune
+python fish_speech/train.py --config-name text2semantic_finetune \
+    model@model.model=dual_ar_2_codebook_large
 ```
 
 !!! info
-    If you want to use lora, please use `--config-name text2semantic_finetune_lora` to start fine-tuning.
+    If you want to use lora, please use `--config-name text2semantic_finetune_lora` to start fine-tuning (still under development).
 
 !!! note
     You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune.yaml`.
 
+!!! note
+    For Windows users, you can use `trainer.strategy.process_group_backend=gloo` to avoid `nccl` issues.
+
 After training is complete, you can refer to the [inference](inference.md) section, and use `--speaker SPK1` to generate speech.
 
 !!! info

+ 19 - 5
docs/en/index.md

@@ -1,5 +1,17 @@
 # Introduction
 
+<div>
+<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
+<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
+</a>
+<a target="_blank" href="http://qm.qq.com/cgi-bin/qm/qr?_wv=1027&k=jCKlUP7QgSm9kh95UlBoYv6s1I-Apl1M&authKey=xI5ttVAp3do68IpEYEalwXSYZFdfxZSkah%2BctF5FIMyN2NqAa003vFtLqJyAVRfF&noverify=0&group_code=593946093">
+<img alt="QQ" src="https://img.shields.io/badge/QQ Group-%2312B7F5?logo=tencent-qq&logoColor=white&style=flat-square"/>
+</a>
+<a target="_blank" href="https://hub.docker.com/r/lengyue233/fish-speech">
+<img alt="Docker" src="https://img.shields.io/docker/pulls/lengyue233/fish-speech?style=flat-square&logo=docker"/>
+</a>
+</div>
+
 !!! warning
     We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area.
 
@@ -10,10 +22,10 @@ This codebase is released under the `BSD-3-Clause` license, and all models are r
 </p>
 
 ## Requirements
-- GPU Memory: 2GB (for inference), 16GB (for fine-tuning)
-- System: Linux (full functionality), Windows (inference only, no support for `torch.compile`)
+- GPU Memory: 4GB (for inference), 16GB (for fine-tuning)
+- System: Linux, Windows
 
-Therefore, we strongly recommend Windows users to use WSL2 or docker to run the codebase.
+We recommend Windows users to use WSL2 or docker to run the codebase, or use the integrated environment developed by the community.
 
 ## Setup
 ```bash
@@ -21,8 +33,8 @@ Therefore, we strongly recommend Windows users to use WSL2 or docker to run the
 conda create -n fish-speech python=3.10
 conda activate fish-speech
 
-# Install pytorch nightly
-pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
+# Install pytorch
+pip3 install torch torchvision torchaudio
 
 # Install fish-speech
 pip3 install -e .
@@ -30,6 +42,7 @@ pip3 install -e .
 
 ## Changelog
 
+- 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models.
 - 2023/12/28: Added `lora` fine-tuning support.
 - 2023/12/27: Add `gradient checkpointing`, `causual sampling`, and `flash-attn` support.
 - 2023/12/19: Updated webui and HTTP API.
@@ -44,3 +57,4 @@ pip3 install -e .
 - [MQTTS](https://github.com/b04901014/MQTTS)
 - [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
 - [Transformers](https://github.com/huggingface/transformers)
+- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)

+ 17 - 13
docs/en/inference.md

@@ -15,8 +15,8 @@ Inference support command line, HTTP API and web UI.
 Download the required `vqgan` and `text2semantic` models from our Hugging Face repository.
     
 ```bash
-huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
-huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 text2semantic-large-v1-4k.pth --local-dir checkpoints
 ```
 
 ### 1. Generate prompt from voice:
@@ -27,7 +27,7 @@ huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth -
 ```bash
 python tools/vqgan/inference.py \
     -i "paimon.wav" \
-    --checkpoint-path "checkpoints/vqgan-v1.pth"
+    --checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 You should get a `fake.npy` file.
 
@@ -37,7 +37,8 @@ python tools/llama/generate.py \
     --text "The text you want to convert" \
     --prompt-text "Your reference text" \
     --prompt-tokens "fake.npy" \
-    --checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
+    --config-name dual_ar_2_codebook_large \
+    --checkpoint-path "checkpoints/text2semantic-large-v1-4k.pth" \
     --num-samples 2 \
     --compile
 ```
@@ -59,7 +60,7 @@ This command will create a `codes_N` file in the working directory, where N is a
 ```bash
 python tools/vqgan/inference.py \
     -i "codes_0.npy" \
-    --checkpoint-path "checkpoints/vqgan-v1.pth"
+    --checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 
 ## HTTP API Inference
@@ -67,21 +68,24 @@ python tools/vqgan/inference.py \
 We provide a HTTP API for inference. You can use the following command to start the server:
 
 ```bash
-python -m tools.api --listen 0.0.0.0:8000
+python -m tools.api \
+    --listen 0.0.0.0:8000 \
+    --llama-checkpoint-path "checkpoints/text2semantic-large-v1-4k.pth" \
+    --llama-config-name dual_ar_2_codebook_large \
+    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 
-After that, you can view and test the API at http://127.0.0.1:8000/docs.  
-
-Generally, you need to first call PUT /v1/models/default to load the model, and then use POST /v1/models/default/invoke for inference. For specific parameters, please refer to the API documentation.
+After that, you can view and test the API at http://127.0.0.1:8000/.  
 
 ## WebUI Inference
 
-Before running the WebUI, you need to start the HTTP service as described above.
-
-Then you can start the WebUI using the following command:
+You can start the WebUI using the following command:
 
 ```bash
-python fish_speech/webui/app.py
+python -m tools.webui \
+    --llama-checkpoint-path "checkpoints/text2semantic-large-v1-4k.pth" \
+    --llama-config-name dual_ar_2_codebook_large \
+    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 
 Enjoy!

+ 1 - 1
docs/en/samples.md

@@ -1,7 +1,7 @@
 # Samples
 
 !!! note
-    Due to insufficient Japanese to English training data, we first phonemicize the text and then use it for generation.
+    Samples in this page are generated using the 0.4 version and have not been updated.
 
 ## Chinese Sentence 1
 ```

+ 22 - 40
docs/zh/finetune.md

@@ -26,19 +26,19 @@
     └── 38.79-40.85.mp3
 ```
 
-你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
+你需要将数据集转为以上格式, 并放到 `data` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
 
 ### 2. 分割训练集和验证集
 
 ```bash
-python tools/vqgan/create_train_split.py data/demo
+python tools/vqgan/create_train_split.py data
 ```
 
-该命令会在 `data/demo` 目录下创建 `data/demo/vq_train_filelist.txt` 和 `data/demo/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
+该命令会在 `data` 目录下创建 `data/vq_train_filelist.txt` 和 `data/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
 
 !!!info
     对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.  
-    请注意, `filelist` 所指向的音频文件必须也位于 `data/demo` 文件夹下.
+    请注意, `filelist` 所指向的音频文件必须也位于 `data` 文件夹下.
 
 ### 3. 启动训练
 
@@ -77,15 +77,12 @@ python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_fine
     └── 38.79-40.85.mp3
 ```
 
-你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`, 标注文件后缀可以为 `.lab` 或 `.txt`.
-
-!!! note
-    你可以通过修改 `fish_speech/configs/data/finetune.yaml` 来修改数据集路径, 以及混合数据集.
+你需要将数据集转为以上格式, 并放到 `data` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`, 标注文件后缀建议为 `.lab`.
 
 !!! warning
     建议先对数据集进行响度匹配, 你可以使用 [fish-audio-preprocess](https://github.com/fishaudio/audio-preprocess) 来完成这一步骤. 
     ```bash
-    fap loudness-norm demo-raw demo --clean
+    fap loudness-norm data-raw data --clean
     ```
 
 ### 2. 批量提取语义 token
@@ -93,29 +90,29 @@ python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_fine
 确保你已经下载了 vqgan 权重, 如果没有, 请运行以下命令:
 
 ```bash
-huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
 ```
 
 对于中国大陆用户, 可使用 mirror 下载.
 
 ```bash
-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
 ```
 
 随后可运行以下命令来提取语义 token:
 
 ```bash
-python tools/vqgan/extract_vq.py data/demo \
+python tools/vqgan/extract_vq.py data \
     --num-workers 1 --batch-size 16 \
     --config-name "vqgan_pretrain" \
-    --checkpoint-path "checkpoints/vqgan-v1.pth"
+    --checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 
 !!! note
     你可以调整 `--num-workers` 和 `--batch-size` 来提高提取速度, 但是请注意不要超过你的显存限制.  
     对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.
 
-该命令会在 `data/demo` 目录下创建 `.npy` 文件, 如下所示:
+该命令会在 `data` 目录下创建 `.npy` 文件, 如下所示:
 
 ```
 .
@@ -139,8 +136,9 @@ python tools/vqgan/extract_vq.py data/demo \
 
 ```bash
 python tools/llama/build_dataset.py \
-    --config "fish_speech/configs/data/finetune.yaml" \
+    --input "data" \
     --output "data/quantized-dataset-ft.protos" \
+    --text-extension .lab \
     --num-workers 16
 ```
 
@@ -149,52 +147,36 @@ python tools/llama/build_dataset.py \
 !!! note
     对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.
 
-### 4. 启动 Rust 数据服务器
-
-由于加载和打乱数据集非常缓慢且占用内存, 因此我们使用 rust 服务器来加载和打乱数据. 该服务器基于 GRPC, 可以通过以下方式安装:
-
-```bash
-cd data_server
-cargo build --release
-```
-
-编译完成后你可以使用以下命令来启动服务器:
-
-```bash
-export RUST_LOG=info # 可选, 用于调试
-data_server/target/release/data_server \
-    --files "data/quantized-dataset-ft.protos" 
-```
-
-!!! note
-    你可以指定多个 `--files` 参数来加载多个数据集.
-
-### 5. 最后, 启动微调
+### 4. 最后, 启动微调
 
 同样的, 请确保你已经下载了 `LLAMA` 权重, 如果没有, 请运行以下命令:
 
 ```bash
-huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 text2semantic-large-v1-4k.pth --local-dir checkpoints
 ```
 
 对于中国大陆用户, 可使用 mirror 下载.
 
 ```bash
-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 text2semantic-large-v1-4k.pth --local-dir checkpoints
 ```
 
 最后, 你可以运行以下命令来启动微调:
 
 ```bash
-python fish_speech/train.py --config-name text2semantic_finetune
+python fish_speech/train.py --config-name text2semantic_finetune \
+    model@model.model=dual_ar_2_codebook_large
 ```
 
 !!! note
-    如果你想使用 lora, 请使用 `--config-name text2semantic_finetune_lora` 来启动微调.
+    如果你想使用 lora, 请使用 `--config-name text2semantic_finetune_lora` 来启动微调 (仍在开发中).
 
 !!! note
     你可以通过修改 `fish_speech/configs/text2semantic_finetune.yaml` 来修改训练参数如 `batch_size`, `gradient_accumulation_steps` 等, 来适应你的显存.
 
+!!! note
+    对于 Windows 用户, 你可以使用 `trainer.strategy.process_group_backend=gloo` 来避免 `nccl` 的问题.
+
 训练结束后, 你可以参考 [推理](inference.md) 部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
 
 !!! info

+ 17 - 4
docs/zh/index.md

@@ -1,5 +1,17 @@
 # 介绍
 
+<div>
+<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
+<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
+</a>
+<a target="_blank" href="http://qm.qq.com/cgi-bin/qm/qr?_wv=1027&k=jCKlUP7QgSm9kh95UlBoYv6s1I-Apl1M&authKey=xI5ttVAp3do68IpEYEalwXSYZFdfxZSkah%2BctF5FIMyN2NqAa003vFtLqJyAVRfF&noverify=0&group_code=593946093">
+<img alt="QQ" src="https://img.shields.io/badge/QQ Group-%2312B7F5?logo=tencent-qq&logoColor=white&style=flat-square"/>
+</a>
+<a target="_blank" href="https://hub.docker.com/r/lengyue233/fish-speech">
+<img alt="Docker" src="https://img.shields.io/docker/pulls/lengyue233/fish-speech?style=flat-square&logo=docker"/>
+</a>
+</div>
+
 !!! warning
     我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.
 
@@ -10,10 +22,10 @@
 </p>
 
 ## 要求
-- GPU内存: 2GB (用于推理), 16GB (用于微调)
-- 系统: Linux (全部功能), Windows (仅推理, 不支持 `torch.compile`)
+- GPU内存: 4GB (用于推理), 16GB (用于微调)
+- 系统: Linux, Windows
 
-因此, 我们强烈建议 Windows 用户使用 WSL2 或 docker 来运行代码库.
+我们建议 Windows 用户使用 WSL2 或 docker 来运行代码库, 或者使用由社区开发的整合环境.
 
 ## 设置
 ```bash
@@ -21,7 +33,7 @@
 conda create -n fish-speech python=3.10
 conda activate fish-speech
 
-# 安装 pytorch 版本
+# 安装 pytorch
 pip3 install torch torchvision torchaudio
 
 # 安装 fish-speech
@@ -45,3 +57,4 @@ pip3 install -e .
 - [MQTTS](https://github.com/b04901014/MQTTS)
 - [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
 - [Transformers](https://github.com/huggingface/transformers)
+- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)

+ 21 - 23
docs/zh/inference.md

@@ -15,13 +15,13 @@
 从我们的 huggingface 仓库下载所需的 `vqgan` 和 `text2semantic` 模型。
     
 ```bash
-huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
-huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 text2semantic-large-v1-4k.pth --local-dir checkpoints
 ```
 对于中国大陆用户,可使用mirror下载。
 ```bash
-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
-HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 text2semantic-large-v1-4k.pth --local-dir checkpoints
 ```
 
 ### 1. 从语音生成 prompt: 
@@ -32,7 +32,7 @@ HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v
 ```bash
 python tools/vqgan/inference.py \
     -i "paimon.wav" \
-    --checkpoint-path "checkpoints/vqgan-v1.pth"
+    --checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 你应该能得到一个 `fake.npy` 文件.
 
@@ -42,7 +42,8 @@ python tools/llama/generate.py \
     --text "要转换的文本" \
     --prompt-text "你的参考文本" \
     --prompt-tokens "fake.npy" \
-    --checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
+    --config-name dual_ar_2_codebook_large \
+    --checkpoint-path "checkpoints/text2semantic-large-v1-4k.pth" \
     --num-samples 2 \
     --compile
 ```
@@ -64,7 +65,7 @@ python tools/llama/generate.py \
 ```bash
 python tools/vqgan/inference.py \
     -i "codes_0.npy" \
-    --checkpoint-path "checkpoints/vqgan-v1.pth"
+    --checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 
 ## HTTP API 推理
@@ -72,30 +73,27 @@ python tools/vqgan/inference.py \
 运行以下命令来启动 HTTP 服务:
 
 ```bash
-python -m tools.api --listen 0.0.0.0:8000
+python -m tools.api \
+    --listen 0.0.0.0:8000 \
+    --llama-checkpoint-path "checkpoints/text2semantic-large-v1-4k.pth" \
+    --llama-config-name dual_ar_2_codebook_large \
+    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
+
 # 推荐中国大陆用户运行以下命令来启动 HTTP 服务:
-HF_ENDPOINT=https://hf-mirror.com python -m tools.api --listen 0.0.0.0:8000
+HF_ENDPOINT=https://hf-mirror.com python -m ...
 ```
 
-随后, 你可以在 `http://127.0.0.1:8000/docs` 中查看并测试 API.  
-一般来说, 你需要先调用 `PUT /v1/models/default` 来加载模型, 然后调用 `POST /v1/models/default/invoke` 来进行推理.
-具体的参数请参考 API 文档.
+随后, 你可以在 `http://127.0.0.1:8000/` 中查看并测试 API.
 
 ## WebUI 推理
 
-在运行 WebUI 之前, 你需要先启动 HTTP 服务, 如上所述.
-
-随后你可以使用以下命令来启动 WebUI:
-
-```bash
-python fish_speech/webui/app.py
-```
-
-或附带参数来启动 WebUI:
+你可以使用以下命令来启动 WebUI:
 
 ```bash
-# 以临时环境变量的方式启动:
-GRADIO_SERVER_NAME=127.0.0.1 GRADIO_SERVER_PORT=7860 python fish_speech/webui/app.py
+python -m tools.webui \
+    --llama-checkpoint-path "checkpoints/text2semantic-large-v1-4k.pth" \
+    --llama-config-name dual_ar_2_codebook_large \
+    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
 ```
 
 祝大家玩得开心!

+ 1 - 1
docs/zh/samples.md

@@ -1,7 +1,7 @@
 # 例子
 
 !!! note
-    由于日英训练数据不足, 我们先将文本音素化, 再用于生成.
+    该页面仍使用 0.4 版本生成, 尚未更新.
 
 ## 中文句子 1
 ```

+ 1 - 1
fish_speech/configs/text2semantic_finetune.yaml

@@ -5,7 +5,7 @@ defaults:
 
 project: text2semantic_finetune_dual_ar
 max_length: 2048
-ckpt_path: checkpoints/text2semantic-medium-v1-2k.pth
+ckpt_path: checkpoints/text2semantic-large-v1-4k.pth
 resume_weights_only: true
 
 # Lightning Trainer

+ 2 - 2
fish_speech/configs/text2semantic_sft.yaml

@@ -30,7 +30,7 @@ tokenizer:
 train_dataset:
   _target_: fish_speech.datasets.text.AutoAugTextDataset
   proto_files:
-    - data/protos/sft/train
+    - data/protos/sft
   tokenizer: ${tokenizer}
   max_length: ${max_length}
   num_codebooks: ${model.model.config.num_codebooks}
@@ -40,7 +40,7 @@ train_dataset:
 val_dataset:
   _target_: fish_speech.datasets.text.AutoAugTextDataset
   proto_files:
-    - data/protos/sft/test
+    - data/protos/sft
   tokenizer: ${tokenizer}
   max_length: ${max_length}
   num_codebooks: ${model.model.config.num_codebooks}

+ 5 - 6
fish_speech/configs/vqgan_finetune.yaml

@@ -14,8 +14,6 @@ trainer:
   max_steps: 100_000
   val_check_interval: 5000
   strategy:
-    _target_: lightning.pytorch.strategies.DDPStrategy
-    process_group_backend: nccl  # This should be override when training on windows
     find_unused_parameters: true
 
 sample_rate: 44100
@@ -23,19 +21,18 @@ hop_length: 512
 num_mels: 128
 n_fft: 2048
 win_length: 2048
-freeze_encoder: true
 
 # Dataset Configuration
 train_dataset:
   _target_: fish_speech.datasets.vqgan.VQGANDataset
-  filelist: data/filelist.train.txt
+  filelist: data/vq_train_filelist.txt
   sample_rate: ${sample_rate}
   hop_length: ${hop_length}
   slice_frames: 512
 
 val_dataset:
   _target_: fish_speech.datasets.vqgan.VQGANDataset
-  filelist: data/filelist.val.txt
+  filelist: data/vq_val_filelist.txt
   sample_rate: ${sample_rate}
   hop_length: ${hop_length}
 
@@ -55,7 +52,9 @@ model:
   weight_adv: 0.2
   weight_vq: 1.0
   weight_mel: 1.0
-  freeze_encoder: false
+
+  # Important: Set the freeze_encoder to true to only train the decoder
+  freeze_encoder: true
 
   encoder:
     _target_: fish_speech.models.vqgan.modules.wavenet.WaveNet

+ 3 - 3
tools/api.py

@@ -144,10 +144,10 @@ def api_invoke_model(
     Invoke model and generate audio
     """
 
-    if args.max_gradio_length > 0 and len(req.text) > args.max_gradio_length:
+    if args.max_text_length > 0 and len(req.text) > args.max_text_length:
         raise HTTPException(
             HTTPStatus.BAD_REQUEST,
-            content=f"Text is too long, max length is {args.max_gradio_length}",
+            content=f"Text is too long, max length is {args.max_text_length}",
         )
 
     try:
@@ -208,7 +208,7 @@ def parse_args():
     parser.add_argument("--half", action="store_true")
     parser.add_argument("--max-length", type=int, default=2048)
     parser.add_argument("--compile", action="store_true")
-    parser.add_argument("--max-gradio-length", type=int, default=0)
+    parser.add_argument("--max-text-length", type=int, default=0)
     parser.add_argument("--listen", type=str, default="127.0.0.1:8000")
 
     return parser.parse_args()

+ 3 - 3
tools/vqgan/extract_vq.py

@@ -42,7 +42,7 @@ logger.add(sys.stderr, format=logger_format)
 @lru_cache(maxsize=1)
 def get_model(
     config_name: str = "vqgan_pretrain",
-    checkpoint_path: str = "checkpoints/vqgan/step_000380000.ckpt",
+    checkpoint_path: str = "checkpoints/vq-gan-group-fsq-2x1024.pth",
 ):
     with initialize(version_base="1.3", config_path="../../fish_speech/configs"):
         cfg = compose(config_name=config_name)
@@ -123,7 +123,7 @@ def process_batch(files: list[Path], model) -> float:
 @click.option("--config-name", default="vqgan_pretrain")
 @click.option(
     "--checkpoint-path",
-    default="checkpoints/vq-gan-group-fsq-8x1024-wn-20x768-30kh.pth",
+    default="checkpoints/vq-gan-group-fsq-2x1024.pth",
 )
 @click.option("--batch-size", default=64)
 @click.option("--filelist", default=None, type=Path)
@@ -174,7 +174,7 @@ def main(
         files = list_files(folder, AUDIO_EXTENSIONS, recursive=True, sort=False)
 
     print(f"Found {len(files)} files")
-    # files = [Path(f) for f in files if not Path(f).with_suffix(".npy").exists()]
+    files = [Path(f) for f in files if not Path(f).with_suffix(".npy").exists()]
 
     total_files = len(files)
     files = files[RANK::WORLD_SIZE]