2 lat temu · bcc28cd40a
--- a/README.md
+++ b/README.md
@@ -9,8 +9,8 @@ We do not hold any responsibility for any illegal usage of the codebase. Please
 
				 我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
			
 
				 
			
 
				 ## Documents / 文档
			
 
				-- [English](https://speech.fish.audio/en/)
			
 
				-- [中文](https://speech.fish.audio/zh/)
			
 
				+- [English](https://speech.fish.audio/zh/latest/en/)
			
 
				+- [中文](https://speech.fish.audio/zh/latest/zh/)
			
 
				 
			
 
				 
			
 
				 ## Credits / 鸣谢
			
--- a/data_server/src/main.rs
+++ b/data_server/src/main.rs
@@ -133,7 +133,7 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
 
				     // Parse command-line arguments
			
 
				     let args = Args::parse();
			
 
				 
			
 
				-    let addr = "[::1]:50051".parse()?;
			
 
				+    let addr = "127.0.0.1:50051".parse()?;
			
 
				     let data_service = MyDataService::new(args.files)?;
			
 
				 
			
 
				     info!("Starting server at {}", addr);
			
--- a/docs/en/finetune.md
+++ b/docs/en/finetune.md
@@ -0,0 +1,119 @@
 
				+# Fine-tuning
			
 
				+
			
 
				+Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
			
 
				+
			
 
				+`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`. Currently, we only support fine-tuning the `LLAMA` model.
			
 
				+
			
 
				+## Fine-tuning LLAMA
			
 
				+### 1. Prepare the dataset
			
 
				+
			
 
				+```
			
 
				+.
			
 
				+├── SPK1
			
 
				+│   ├── 21.15-26.44.lab
			
 
				+│   ├── 21.15-26.44.mp3
			
 
				+│   ├── 27.51-29.98.lab
			
 
				+│   ├── 27.51-29.98.mp3
			
 
				+│   ├── 30.1-32.71.lab
			
 
				+│   └── 30.1-32.71.mp3
			
 
				+└── SPK2
			
 
				+    ├── 38.79-40.85.lab
			
 
				+    └── 38.79-40.85.mp3
			
 
				+```
			
 
				+
			
 
				+You need to convert your dataset into the above format and place it under `data/demo`. The audio file can have the extensions `.mp3`, `.wav`, or `.flac`, and the annotation file can have the extensions `.lab` or `.txt`.
			
 
				+
			
 
				+!!! note
			
 
				+    You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
			
 
				+
			
 
				+### 2. Batch-wise extraction of semantic tokens
			
 
				+
			
 
				+Make sure you have downloaded the VQGAN weights. If not, run the following command:
			
 
				+
			
 
				+```bash
			
 
				+huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
			
 
				+```
			
 
				+
			
 
				+You can then run the following command to extract semantic tokens:
			
 
				+
			
 
				+```bash
			
 
				+python tools/vqgan/extract_vq.py data/demo \
			
 
				+    --num-workers 1 --batch-size 16 \
			
 
				+    --config-name "vqgan_pretrain" \
			
 
				+    --checkpoint-path "checkpoints/vqgan-v1.pth"
			
 
				+```
			
 
				+
			
 
				+!!! note
			
 
				+    You can adjust `--num-workers` and `--batch-size` to increase extraction speed, but please make sure not to exceed your GPU memory limit.
			
 
				+
			
 
				+This command will create `.npy` files in the `data/demo` directory, as shown below:
			
 
				+
			
 
				+```
			
 
				+.
			
 
				+├── SPK1
			
 
				+│   ├── 21.15-26.44.lab
			
 
				+│   ├── 21.15-26.44.mp3
			
 
				+│   ├── 21.15-26.44.npy
			
 
				+│   ├── 27.51-29.98.lab
			
 
				+│   ├── 27.51-29.98.mp3
			
 
				+│   ├── 27.51-29.98.npy
			
 
				+│   ├── 30.1-32.71.lab
			
 
				+│   ├── 30.1-32.71.mp3
			
 
				+│   └── 30.1-32.71.npy
			
 
				+└── SPK2
			
 
				+    ├── 38.79-40.85.lab
			
 
				+    ├── 38.79-40.85.mp3
			
 
				+    └── 38.79-40.85.npy
			
 
				+```
			
 
				+
			
 
				+### 3. Pack the dataset into protobuf
			
 
				+
			
 
				+```bash
			
 
				+python tools/llama/build_dataset.py \
			
 
				+    --config "fish_speech/configs/data/finetune.yaml" \
			
 
				+    --output "data/quantized-dataset-ft.protos"
			
 
				+```
			
 
				+
			
 
				+After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
			
 
				+
			
 
				+### 4. Start the Rust data server
			
 
				+
			
 
				+Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
			
 
				+
			
 
				+```bash
			
 
				+cd data_server
			
 
				+cargo build --release
			
 
				+```
			
 
				+
			
 
				+After the compilation is complete, you can start the server using the following command:
			
 
				+
			
 
				+```bash
			
 
				+export RUST_LOG=info # Optional, for debugging
			
 
				+data_server/target/release/data_server \
			
 
				+    --files "data/quantized-dataset-ft.protos" 
			
 
				+```
			
 
				+
			
 
				+!!! note
			
 
				+    You can specify multiple `--files` parameters to load multiple datasets.
			
 
				+
			
 
				+### 5. Finally, start the fine-tuning
			
 
				+
			
 
				+Similarly, make sure you have downloaded the `LLAMA` weights. If not, run the following command:
			
 
				+
			
 
				+```bash
			
 
				+huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
			
 
				+```
			
 
				+
			
 
				+Finally, you can start the fine-tuning by running the following command:
			
 
				+```bash
			
 
				+python fish_speech/train.py --config-name text2semantic_finetune_spk
			
 
				+```
			
 
				+
			
 
				+!!! note
			
 
				+    You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
			
 
				+
			
 
				+After training is complete, you can refer to the inference section to generate speech.
			
 
				+
			
 
				+!!! info
			
 
				+    By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
			
 
				+    If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
			
--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -1,3 +1,46 @@
 
				-# Welcome to Fish Speech
			
 
				+# Introduction
			
 
				+
			
 
				+!!! warning
			
 
				+    We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area.
			
 
				+
			
 
				+This codebase is released under the `BSD-3-Clause` license, and all models are released under the CC-BY-NC-SA-4.0 license.
			
 
				+
			
 
				+<p align="center">
			
 
				+<img src="../assets/figs/diagram.png" width="75%">
			
 
				+</p>
			
 
				+
			
 
				+## Requirements
			
 
				+- GPU Memory: 2GB (for inference), 16GB (for fine-tuning)
			
 
				+- System: Linux (full functionality), Windows (inference only, no support for `flash-attn`, no support for `torch.compile`)
			
 
				+
			
 
				+Therefore, we strongly recommend Windows users to use WSL2 or docker to run the codebase.
			
 
				+
			
 
				+## Setup
			
 
				+```bash
			
 
				+# Create a python 3.10 virtual environment, you can also use virtualenv
			
 
				+conda create -n fish-speech python=3.10
			
 
				+conda activate fish-speech
			
 
				+
			
 
				+# Install pytorch
			
 
				+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
			
 
				+
			
 
				+# Install flash-attn (for Linux)
			
 
				+pip3 install ninja && MAX_JOBS=4 pip3 install flash-attn --no-build-isolation
			
 
				+
			
 
				+# Install fish-speech
			
 
				+pip3 install -e .
			
 
				+```
			
 
				+
			
 
				+## Changelog
			
 
				+
			
 
				+- 2023/12/18: Updated fine-tuning documentation and related examples.
			
 
				+- 2023/12/17: Updated `text2semantic` model, supporting phoneme-free mode.
			
 
				+- 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only).
			
 
				+
			
 
				+## Acknowledgements
			
 
				+- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
			
 
				+- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
			
 
				+- [GPT VITS](https://github.com/innnky/gpt-vits)
			
 
				+- [MQTTS](https://github.com/b04901014/MQTTS)
			
 
				+- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
			
 
				 
			
 
				-English Document is under construction.
			
--- a/docs/en/inference.md
+++ b/docs/en/inference.md
@@ -0,0 +1,56 @@
 
				+# Inference
			
 
				+
			
 
				+In the plan, inference is expected to support both command line and webui methods, but currently, only the command-line reasoning function has been completed.  
			
 
				+
			
 
				+!!! note
			
 
				+    Overall, reasoning consists of several parts:
			
 
				+
			
 
				+    1. Encode a given 5-10 seconds of voice using VQGAN.
			
 
				+    2. Input the encoded semantic tokens and the corresponding text into the language model as an example.
			
 
				+    3. Given a new piece of text, let the model generate the corresponding semantic tokens.
			
 
				+    4. Input the generated semantic tokens into VQGAN to decode and generate the corresponding voice.
			
 
				+
			
 
				+## Command Line Inference
			
 
				+
			
 
				+Download the required `vqgan` and `text2semantic` models from our Hugging Face repository.
			
 
				+    
			
 
				+```bash
			
 
				+huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
			
 
				+huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
			
 
				+```
			
 
				+
			
 
				+### 1. Generate prompt from voice:
			
 
				+
			
 
				+!!! note
			
 
				+    If you plan to let the model randomly choose a voice timbre, you can skip this step.
			
 
				+
			
 
				+```bash
			
 
				+python tools/vqgan/inference.py \
			
 
				+    -i "paimon.wav" \
			
 
				+    --checkpoint-path "checkpoints/vqgan-v1.pth"
			
 
				+```
			
 
				+You should get a `fake.npy` file.
			
 
				+
			
 
				+### 2. Generate semantic tokens from text:
			
 
				+```bash
			
 
				+python tools/llama/generate.py \
			
 
				+    --text "The text you want to convert" \
			
 
				+    --prompt-text "Your reference text" \
			
 
				+    --prompt-tokens "fake.npy" \
			
 
				+    --checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
			
 
				+    --num-samples 2 \
			
 
				+    --compile
			
 
				+```
			
 
				+
			
 
				+This command will create a `codes_N` file in the working directory, where N is an integer starting from 0.
			
 
				+
			
 
				+!!! note
			
 
				+    You may want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~500 tokens/second).
			
 
				+    Correspondingly, if you do not plan to use acceleration, you can comment out the `--compile` parameter.
			
 
				+
			
 
				+### 3. Generate vocals from semantic tokens:
			
 
				+```bash
			
 
				+python tools/vqgan/inference.py \
			
 
				+    -i "codes_0.npy" \
			
 
				+    --checkpoint-path "checkpoints/vqgan-v1.pth"
			
 
				+```
			
--- a/docs/en/samples.md
+++ b/docs/en/samples.md
@@ -0,0 +1,145 @@
 
				+# Example
			
 
				+
			
 
				+!!! note
			
 
				+    Due to insufficient Japanese to English training data, we first phonemicize the text and then use it for generation.
			
 
				+
			
 
				+## Chinese Sentence 1
			
 
				+```
			
 
				+人间灯火倒映湖中，她的渴望让静水泛起涟漪。若代价只是孤独，那就让这份愿望肆意流淌。
			
 
				+流入她所注视的世间，也流入她如湖水般澄澈的目光。
			
 
				+```
			
 
				+
			
 
				+<table>
			
 
				+    <thead>
			
 
				+    <tr>
			
 
				+        <th>Speaker</th>
			
 
				+        <th>Input Audio</th>
			
 
				+        <th>Synthesized Audio</th>
			
 
				+    </tr>
			
 
				+    </thead>
			
 
				+    <tbody>
			
 
				+    <tr>
			
 
				+        <td>Nahida (Genshin Impact)</td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/0_input.wav" /></td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/0_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Zhongli (Genshin Impact)</td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/1_input.wav" /></td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/1_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Furina (Genshin Impact)</td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/2_input.wav" /></td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/2_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Random Speaker 1</td>
			
 
				+        <td> - </td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/3_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Random Speaker 2</td>
			
 
				+        <td> - </td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/4_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    </tbody>
			
 
				+</table>
			
 
				+
			
 
				+
			
 
				+## Chinese Sentence 2 (Long Sentence)
			
 
				+```
			
 
				+你们这个是什么群啊，你们这是害人不浅啊你们这个群！谁是群主，出来！真的太过分了。你们搞这个群干什么？
			
 
				+我儿子每一科的成绩都不过那个平均分呐，他现在初二，你叫我儿子怎么办啊？他现在还不到高中啊？
			
 
				+你们害死我儿子了！快点出来你这个群主！再这样我去报警了啊！我跟你们说你们这一帮人啊，一天到晚啊，
			
 
				+搞这些什么游戏啊，动漫啊，会害死你们的，你们没有前途我跟你说。你们这九百多个人，好好学习不好吗？
			
 
				+一天到晚在上网。有什么意思啊？麻烦你重视一下你们的生活的目标啊？有一点学习目标行不行？一天到晚上网是不是人啊？
			
 
				+```
			
 
				+
			
 
				+<table>
			
 
				+    <thead>
			
 
				+    <tr>
			
 
				+        <th>Speaker</th>
			
 
				+        <th>Input Audio</th>
			
 
				+        <th>Synthesized Audio</th>
			
 
				+    </tr>
			
 
				+    </thead>
			
 
				+    <tbody>
			
 
				+    <tr>
			
 
				+        <td>Nahida (Genshin Impact)</td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/0_input.wav" /></td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/5_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Seiki (Honkai: StarRail)</td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/6_input.wav" /></td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/6_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    </tbody>
			
 
				+</table>
			
 
				+
			
 
				+## English Sentence
			
 
				+
			
 
				+```
			
 
				+In the realm of advanced technology, the evolution of artificial intelligence stands as a 
			
 
				+monumental achievement. This dynamic field, constantly pushing the boundaries of what 
			
 
				+machines can do, has seen rapid growth and innovation. From deciphering complex data 
			
 
				+patterns to driving cars autonomously, AI's applications are vast and diverse.
			
 
				+```
			
 
				+
			
 
				+<table>
			
 
				+    <thead>
			
 
				+    <tr>
			
 
				+        <th>Speaker</th>
			
 
				+        <th>Input Audio</th>
			
 
				+        <th>Synthesized Audio</th>
			
 
				+    </tr>
			
 
				+    </thead>
			
 
				+    <tbody>
			
 
				+    <tr>
			
 
				+        <td>Speaker 200 (LibriTTS)</td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/7_input.wav" /></td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/7_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Random Speaker 1</td>
			
 
				+        <td> - </td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/8_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Random Speaker 2</td>
			
 
				+        <td> - </td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/9_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    </tbody>
			
 
				+</table>
			
 
				+
			
 
				+## Japanese Sentence
			
 
				+
			
 
				+```
			
 
				+先進技術の領域において、人工知能の進化は画期的な成果として立っています。常に機械ができることの限界を
			
 
				+押し広げているこのダイナミックな分野は、急速な成長と革新を見せています。複雑なデータパターンの解読か
			
 
				+ら自動運転車の操縦まで、AIの応用は広範囲に及びます。
			
 
				+```
			
 
				+
			
 
				+<table>
			
 
				+    <thead>
			
 
				+    <tr>
			
 
				+        <th>Speaker</th>
			
 
				+        <th>Input Audio</th>
			
 
				+        <th>Synthesized Audio</th>
			
 
				+    </tr>
			
 
				+    </thead>
			
 
				+    <tbody>
			
 
				+    <tr>
			
 
				+        <td>Random Speaker 1</td>
			
 
				+        <td> - </td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/10_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    <tr>
			
 
				+        <td>Random Speaker 2</td>
			
 
				+        <td> - </td>
			
 
				+        <td><audio controls preload="auto" src="../../assets/audios/11_output.wav" /></td>
			
 
				+    </tr>
			
 
				+    </tbody>
			
 
				+</table>
			
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -33,6 +33,7 @@ pip3 install -e .
 
				 
			
 
				 ## 更新日志
			
 
				 
			
 
				+- 2023/12/18: 更新了微调文档和相关例子.
			
 
				 - 2023/12/17: 更新了 `text2semantic` 模型, 支持无音素模式.
			
 
				 - 2023/12/13: 测试版发布, 包含 VQGAN 模型和一个基于 LLAMA 的语言模型 (只支持音素).