Przeglądaj źródła

Fix data server ip & add English document

Lengyue 2 lat temu
rodzic
commit
bcc28cd40a
7 zmienionych plików z 369 dodań i 5 usunięć
  1. 2 2
      README.md
  2. 1 1
      data_server/src/main.rs
  3. 119 0
      docs/en/finetune.md
  4. 45 2
      docs/en/index.md
  5. 56 0
      docs/en/inference.md
  6. 145 0
      docs/en/samples.md
  7. 1 0
      docs/zh/index.md

+ 2 - 2
README.md

@@ -9,8 +9,8 @@ We do not hold any responsibility for any illegal usage of the codebase. Please
 我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
 
 ## Documents / 文档
-- [English](https://speech.fish.audio/en/)
-- [中文](https://speech.fish.audio/zh/)
+- [English](https://speech.fish.audio/zh/latest/en/)
+- [中文](https://speech.fish.audio/zh/latest/zh/)
 
 
 ## Credits / 鸣谢

+ 1 - 1
data_server/src/main.rs

@@ -133,7 +133,7 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
     // Parse command-line arguments
     let args = Args::parse();
 
-    let addr = "[::1]:50051".parse()?;
+    let addr = "127.0.0.1:50051".parse()?;
     let data_service = MyDataService::new(args.files)?;
 
     info!("Starting server at {}", addr);

+ 119 - 0
docs/en/finetune.md

@@ -0,0 +1,119 @@
+# Fine-tuning
+
+Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
+
+`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`. Currently, we only support fine-tuning the `LLAMA` model.
+
+## Fine-tuning LLAMA
+### 1. Prepare the dataset
+
+```
+.
+├── SPK1
+│   ├── 21.15-26.44.lab
+│   ├── 21.15-26.44.mp3
+│   ├── 27.51-29.98.lab
+│   ├── 27.51-29.98.mp3
+│   ├── 30.1-32.71.lab
+│   └── 30.1-32.71.mp3
+└── SPK2
+    ├── 38.79-40.85.lab
+    └── 38.79-40.85.mp3
+```
+
+You need to convert your dataset into the above format and place it under `data/demo`. The audio file can have the extensions `.mp3`, `.wav`, or `.flac`, and the annotation file can have the extensions `.lab` or `.txt`.
+
+!!! note
+    You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
+
+### 2. Batch-wise extraction of semantic tokens
+
+Make sure you have downloaded the VQGAN weights. If not, run the following command:
+
+```bash
+huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
+```
+
+You can then run the following command to extract semantic tokens:
+
+```bash
+python tools/vqgan/extract_vq.py data/demo \
+    --num-workers 1 --batch-size 16 \
+    --config-name "vqgan_pretrain" \
+    --checkpoint-path "checkpoints/vqgan-v1.pth"
+```
+
+!!! note
+    You can adjust `--num-workers` and `--batch-size` to increase extraction speed, but please make sure not to exceed your GPU memory limit.
+
+This command will create `.npy` files in the `data/demo` directory, as shown below:
+
+```
+.
+├── SPK1
+│   ├── 21.15-26.44.lab
+│   ├── 21.15-26.44.mp3
+│   ├── 21.15-26.44.npy
+│   ├── 27.51-29.98.lab
+│   ├── 27.51-29.98.mp3
+│   ├── 27.51-29.98.npy
+│   ├── 30.1-32.71.lab
+│   ├── 30.1-32.71.mp3
+│   └── 30.1-32.71.npy
+└── SPK2
+    ├── 38.79-40.85.lab
+    ├── 38.79-40.85.mp3
+    └── 38.79-40.85.npy
+```
+
+### 3. Pack the dataset into protobuf
+
+```bash
+python tools/llama/build_dataset.py \
+    --config "fish_speech/configs/data/finetune.yaml" \
+    --output "data/quantized-dataset-ft.protos"
+```
+
+After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
+
+### 4. Start the Rust data server
+
+Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
+
+```bash
+cd data_server
+cargo build --release
+```
+
+After the compilation is complete, you can start the server using the following command:
+
+```bash
+export RUST_LOG=info # Optional, for debugging
+data_server/target/release/data_server \
+    --files "data/quantized-dataset-ft.protos" 
+```
+
+!!! note
+    You can specify multiple `--files` parameters to load multiple datasets.
+
+### 5. Finally, start the fine-tuning
+
+Similarly, make sure you have downloaded the `LLAMA` weights. If not, run the following command:
+
+```bash
+huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+```
+
+Finally, you can start the fine-tuning by running the following command:
+```bash
+python fish_speech/train.py --config-name text2semantic_finetune_spk
+```
+
+!!! note
+    You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
+
+After training is complete, you can refer to the inference section to generate speech.
+
+!!! info
+    By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
+    If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.

+ 45 - 2
docs/en/index.md

@@ -1,3 +1,46 @@
-# Welcome to Fish Speech
+# Introduction
+
+!!! warning
+    We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area.
+
+This codebase is released under the `BSD-3-Clause` license, and all models are released under the CC-BY-NC-SA-4.0 license.
+
+<p align="center">
+<img src="../assets/figs/diagram.png" width="75%">
+</p>
+
+## Requirements
+- GPU Memory: 2GB (for inference), 16GB (for fine-tuning)
+- System: Linux (full functionality), Windows (inference only, no support for `flash-attn`, no support for `torch.compile`)
+
+Therefore, we strongly recommend Windows users to use WSL2 or docker to run the codebase.
+
+## Setup
+```bash
+# Create a python 3.10 virtual environment, you can also use virtualenv
+conda create -n fish-speech python=3.10
+conda activate fish-speech
+
+# Install pytorch
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+
+# Install flash-attn (for Linux)
+pip3 install ninja && MAX_JOBS=4 pip3 install flash-attn --no-build-isolation
+
+# Install fish-speech
+pip3 install -e .
+```
+
+## Changelog
+
+- 2023/12/18: Updated fine-tuning documentation and related examples.
+- 2023/12/17: Updated `text2semantic` model, supporting phoneme-free mode.
+- 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only).
+
+## Acknowledgements
+- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
+- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
+- [GPT VITS](https://github.com/innnky/gpt-vits)
+- [MQTTS](https://github.com/b04901014/MQTTS)
+- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
 
-English Document is under construction.

+ 56 - 0
docs/en/inference.md

@@ -0,0 +1,56 @@
+# Inference
+
+In the plan, inference is expected to support both command line and webui methods, but currently, only the command-line reasoning function has been completed.  
+
+!!! note
+    Overall, reasoning consists of several parts:
+
+    1. Encode a given 5-10 seconds of voice using VQGAN.
+    2. Input the encoded semantic tokens and the corresponding text into the language model as an example.
+    3. Given a new piece of text, let the model generate the corresponding semantic tokens.
+    4. Input the generated semantic tokens into VQGAN to decode and generate the corresponding voice.
+
+## Command Line Inference
+
+Download the required `vqgan` and `text2semantic` models from our Hugging Face repository.
+    
+```bash
+huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
+huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
+```
+
+### 1. Generate prompt from voice:
+
+!!! note
+    If you plan to let the model randomly choose a voice timbre, you can skip this step.
+
+```bash
+python tools/vqgan/inference.py \
+    -i "paimon.wav" \
+    --checkpoint-path "checkpoints/vqgan-v1.pth"
+```
+You should get a `fake.npy` file.
+
+### 2. Generate semantic tokens from text:
+```bash
+python tools/llama/generate.py \
+    --text "The text you want to convert" \
+    --prompt-text "Your reference text" \
+    --prompt-tokens "fake.npy" \
+    --checkpoint-path "checkpoints/text2semantic-400m-v0.2-4k.pth" \
+    --num-samples 2 \
+    --compile
+```
+
+This command will create a `codes_N` file in the working directory, where N is an integer starting from 0.
+
+!!! note
+    You may want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~500 tokens/second).
+    Correspondingly, if you do not plan to use acceleration, you can comment out the `--compile` parameter.
+
+### 3. Generate vocals from semantic tokens:
+```bash
+python tools/vqgan/inference.py \
+    -i "codes_0.npy" \
+    --checkpoint-path "checkpoints/vqgan-v1.pth"
+```

+ 145 - 0
docs/en/samples.md

@@ -0,0 +1,145 @@
+# Example
+
+!!! note
+    Due to insufficient Japanese to English training data, we first phonemicize the text and then use it for generation.
+
+## Chinese Sentence 1
+```
+人间灯火倒映湖中,她的渴望让静水泛起涟漪。若代价只是孤独,那就让这份愿望肆意流淌。
+流入她所注视的世间,也流入她如湖水般澄澈的目光。
+```
+
+<table>
+    <thead>
+    <tr>
+        <th>Speaker</th>
+        <th>Input Audio</th>
+        <th>Synthesized Audio</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>Nahida (Genshin Impact)</td>
+        <td><audio controls preload="auto" src="../../assets/audios/0_input.wav" /></td>
+        <td><audio controls preload="auto" src="../../assets/audios/0_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Zhongli (Genshin Impact)</td>
+        <td><audio controls preload="auto" src="../../assets/audios/1_input.wav" /></td>
+        <td><audio controls preload="auto" src="../../assets/audios/1_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Furina (Genshin Impact)</td>
+        <td><audio controls preload="auto" src="../../assets/audios/2_input.wav" /></td>
+        <td><audio controls preload="auto" src="../../assets/audios/2_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Random Speaker 1</td>
+        <td> - </td>
+        <td><audio controls preload="auto" src="../../assets/audios/3_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Random Speaker 2</td>
+        <td> - </td>
+        <td><audio controls preload="auto" src="../../assets/audios/4_output.wav" /></td>
+    </tr>
+    </tbody>
+</table>
+
+
+## Chinese Sentence 2 (Long Sentence)
+```
+你们这个是什么群啊,你们这是害人不浅啊你们这个群!谁是群主,出来!真的太过分了。你们搞这个群干什么?
+我儿子每一科的成绩都不过那个平均分呐,他现在初二,你叫我儿子怎么办啊?他现在还不到高中啊?
+你们害死我儿子了!快点出来你这个群主!再这样我去报警了啊!我跟你们说你们这一帮人啊,一天到晚啊,
+搞这些什么游戏啊,动漫啊,会害死你们的,你们没有前途我跟你说。你们这九百多个人,好好学习不好吗?
+一天到晚在上网。有什么意思啊?麻烦你重视一下你们的生活的目标啊?有一点学习目标行不行?一天到晚上网是不是人啊?
+```
+
+<table>
+    <thead>
+    <tr>
+        <th>Speaker</th>
+        <th>Input Audio</th>
+        <th>Synthesized Audio</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>Nahida (Genshin Impact)</td>
+        <td><audio controls preload="auto" src="../../assets/audios/0_input.wav" /></td>
+        <td><audio controls preload="auto" src="../../assets/audios/5_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Seiki (Honkai: StarRail)</td>
+        <td><audio controls preload="auto" src="../../assets/audios/6_input.wav" /></td>
+        <td><audio controls preload="auto" src="../../assets/audios/6_output.wav" /></td>
+    </tr>
+    </tbody>
+</table>
+
+## English Sentence
+
+```
+In the realm of advanced technology, the evolution of artificial intelligence stands as a 
+monumental achievement. This dynamic field, constantly pushing the boundaries of what 
+machines can do, has seen rapid growth and innovation. From deciphering complex data 
+patterns to driving cars autonomously, AI's applications are vast and diverse.
+```
+
+<table>
+    <thead>
+    <tr>
+        <th>Speaker</th>
+        <th>Input Audio</th>
+        <th>Synthesized Audio</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>Speaker 200 (LibriTTS)</td>
+        <td><audio controls preload="auto" src="../../assets/audios/7_input.wav" /></td>
+        <td><audio controls preload="auto" src="../../assets/audios/7_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Random Speaker 1</td>
+        <td> - </td>
+        <td><audio controls preload="auto" src="../../assets/audios/8_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Random Speaker 2</td>
+        <td> - </td>
+        <td><audio controls preload="auto" src="../../assets/audios/9_output.wav" /></td>
+    </tr>
+    </tbody>
+</table>
+
+## Japanese Sentence
+
+```
+先進技術の領域において、人工知能の進化は画期的な成果として立っています。常に機械ができることの限界を
+押し広げているこのダイナミックな分野は、急速な成長と革新を見せています。複雑なデータパターンの解読か
+ら自動運転車の操縦まで、AIの応用は広範囲に及びます。
+```
+
+<table>
+    <thead>
+    <tr>
+        <th>Speaker</th>
+        <th>Input Audio</th>
+        <th>Synthesized Audio</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+        <td>Random Speaker 1</td>
+        <td> - </td>
+        <td><audio controls preload="auto" src="../../assets/audios/10_output.wav" /></td>
+    </tr>
+    <tr>
+        <td>Random Speaker 2</td>
+        <td> - </td>
+        <td><audio controls preload="auto" src="../../assets/audios/11_output.wav" /></td>
+    </tr>
+    </tbody>
+</table>

+ 1 - 0
docs/zh/index.md

@@ -33,6 +33,7 @@ pip3 install -e .
 
 ## 更新日志
 
+- 2023/12/18: 更新了微调文档和相关例子.
 - 2023/12/17: 更新了 `text2semantic` 模型, 支持无音素模式.
 - 2023/12/13: 测试版发布, 包含 VQGAN 模型和一个基于 LLAMA 的语言模型 (只支持音素).