Lengyue 2 лет назад
Родитель
Сommit
c0585bff0f
12 измененных файлов с 142 добавлено и 36 удалено
  1. 5 2
      README.md
  2. 63 3
      docs/en/finetune.md
  3. 1 0
      docs/en/index.md
  4. 25 1
      docs/en/inference.md
  5. 1 1
      docs/en/samples.md
  6. 0 4
      docs/index.md
  7. 2 0
      docs/requirements.txt
  8. 0 3
      docs/stylesheets/extra.css
  9. 12 10
      docs/zh/finetune.md
  10. 2 1
      docs/zh/index.md
  11. 1 1
      docs/zh/inference.md
  12. 30 10
      mkdocs.yml

+ 5 - 2
README.md

@@ -6,12 +6,15 @@ This codebase is released under BSD-3-Clause License, and all models are release
 
 
 ## Disclaimer / 免责声明
 ## Disclaimer / 免责声明
 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.  
 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.  
-我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
+我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.
 
 
 ## Documents / 文档
 ## Documents / 文档
 - [English](https://speech.fish.audio/en/)
 - [English](https://speech.fish.audio/en/)
-- [中文](https://speech.fish.audio/zh/)
+- [中文](https://speech.fish.audio/)
 
 
+## Samples / 例子
+- [English](https://speech.fish.audio/en/samples/)
+- [中文](https://speech.fish.audio/samples/)
 
 
 ## Credits / 鸣谢
 ## Credits / 鸣谢
 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)

+ 63 - 3
docs/en/finetune.md

@@ -2,7 +2,63 @@
 
 
 Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
 Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
 
 
-`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`. Currently, we only support fine-tuning the `LLAMA` model.
+`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`.
+
+!!! info 
+    You should first conduct the following test to determine if you need to fine-tune `VQGAN`:
+    ```bash
+    python tools/vqgan/inference.py -i test.wav
+    ```
+    This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`.
+
+    Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
+
+## Fine-tuning VQGAN
+### 1. Prepare the Dataset
+
+```
+.
+├── SPK1
+│   ├── 21.15-26.44.mp3
+│   ├── 27.51-29.98.mp3
+│   └── 30.1-32.71.mp3
+└── SPK2
+    └── 38.79-40.85.mp3
+```
+
+You need to format your dataset as shown above and place it under `data/demo`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
+
+### 2. Split Training and Validation Sets
+
+```bash
+python tools/vqgan/create_train_split.py data/demo
+```
+
+This command will create `data/demo/vq_train_filelist.txt` and `data/demo/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
+
+!!!info
+    For the VITS format, you can specify a file list using `--filelist xxx.list`.
+    Please note that the audio files in `filelist` must also be located in the `data/demo` folder.
+
+### 3. Start Training
+
+```bash
+python fish_speech/train.py --config-name vqgan_finetune
+```
+
+!!! note
+    You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
+
+### 4. Test the Audio
+    
+```bash
+python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
+```
+
+You can review `fake.wav` to assess the fine-tuning results.
+
+!!! note
+    You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
 
 
 ## Fine-tuning LLAMA
 ## Fine-tuning LLAMA
 ### 1. Prepare the dataset
 ### 1. Prepare the dataset
@@ -26,7 +82,7 @@ You need to convert your dataset into the above format and place it under `data/
 !!! note
 !!! note
     You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
     You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
 
 
-### 2. Batch-wise extraction of semantic tokens
+### 2. Batch extraction of semantic tokens
 
 
 Make sure you have downloaded the VQGAN weights. If not, run the following command:
 Make sure you have downloaded the VQGAN weights. If not, run the following command:
 
 
@@ -76,6 +132,10 @@ python tools/llama/build_dataset.py \
 
 
 After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
 After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
 
 
+!!!info
+    For the VITS format, you can specify a file list using `--filelist xxx.list`.
+    Please note that the audio files referenced in `filelist` must also be located in the `data/demo` folder.
+
 ### 4. Start the Rust data server
 ### 4. Start the Rust data server
 
 
 Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
 Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
@@ -112,7 +172,7 @@ python fish_speech/train.py --config-name text2semantic_finetune_spk
 !!! note
 !!! note
     You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
     You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
 
 
-After training is complete, you can refer to the inference section to generate speech.
+After training is complete, you can refer to the [inference](inference.md) section, and use `--speaker SPK1` to generate speech.
 
 
 !!! info
 !!! info
     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.

+ 1 - 0
docs/en/index.md

@@ -33,6 +33,7 @@ pip3 install -e .
 
 
 ## Changelog
 ## Changelog
 
 
+- 2023/12/19: Updated webui and HTTP API.
 - 2023/12/18: Updated fine-tuning documentation and related examples.
 - 2023/12/18: Updated fine-tuning documentation and related examples.
 - 2023/12/17: Updated `text2semantic` model, supporting phoneme-free mode.
 - 2023/12/17: Updated `text2semantic` model, supporting phoneme-free mode.
 - 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only).
 - 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only).

+ 25 - 1
docs/en/inference.md

@@ -1,6 +1,6 @@
 # Inference
 # Inference
 
 
-In the plan, inference is expected to support both command line and webui methods, but currently, only the command-line reasoning function has been completed.  
+Inference support command line, HTTP API and web UI.
 
 
 !!! note
 !!! note
     Overall, reasoning consists of several parts:
     Overall, reasoning consists of several parts:
@@ -57,3 +57,27 @@ python tools/vqgan/inference.py \
     -i "codes_0.npy" \
     -i "codes_0.npy" \
     --checkpoint-path "checkpoints/vqgan-v1.pth"
     --checkpoint-path "checkpoints/vqgan-v1.pth"
 ```
 ```
+
+## HTTP API Inference
+
+We provide a HTTP API for inference. You can use the following command to start the server:
+
+```bash
+python -m zibai tools.api_server:app --listen 127.0.0.1:8000
+```
+
+After that, you can view and test the API at http://127.0.0.1:8000/docs.  
+
+Generally, you need to first call PUT /v1/models/default to load the model, and then use POST /v1/models/default/invoke for inference. For specific parameters, please refer to the API documentation.
+
+## WebUI Inference
+
+Before running the WebUI, you need to start the HTTP service as described above.
+
+Then you can start the WebUI using the following command:
+
+```bash
+python fish_speech/webui/app.py
+```
+
+Enjoy!

+ 1 - 1
docs/en/samples.md

@@ -1,4 +1,4 @@
-# Example
+# Samples
 
 
 !!! note
 !!! note
     Due to insufficient Japanese to English training data, we first phonemicize the text and then use it for generation.
     Due to insufficient Japanese to English training data, we first phonemicize the text and then use it for generation.

+ 0 - 4
docs/index.md

@@ -1,4 +0,0 @@
----
-template: redirect.html
-location: ./zh/
----

+ 2 - 0
docs/requirements.txt

@@ -1 +1,3 @@
 mkdocs-material
 mkdocs-material
+mkdocs-static-i18n[material]
+mkdocs[i18n]

+ 0 - 3
docs/stylesheets/extra.css

@@ -1,6 +1,3 @@
 .md-grid {
 .md-grid {
   max-width: 1440px; 
   max-width: 1440px; 
 }
 }
-.md-tabs {
-  display: none;
-}

+ 12 - 10
docs/zh/finetune.md

@@ -11,7 +11,7 @@
     ```
     ```
     该测试会生成一个 `fake.wav` 文件, 如果该文件的音色和说话人的音色不同, 或者质量不高, 你需要微调 `VQGAN`.
     该测试会生成一个 `fake.wav` 文件, 如果该文件的音色和说话人的音色不同, 或者质量不高, 你需要微调 `VQGAN`.
 
 
-    相应的, 你可以参考 [推理](../inference/) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
+    相应的, 你可以参考 [推理](inference.md) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
 
 
 ## VQGAN 微调
 ## VQGAN 微调
 ### 1. 准备数据集
 ### 1. 准备数据集
@@ -19,18 +19,14 @@
 ```
 ```
 .
 .
 ├── SPK1
 ├── SPK1
-│   ├── 21.15-26.44.lab
 │   ├── 21.15-26.44.mp3
 │   ├── 21.15-26.44.mp3
-│   ├── 27.51-29.98.lab
 │   ├── 27.51-29.98.mp3
 │   ├── 27.51-29.98.mp3
-│   ├── 30.1-32.71.lab
 │   └── 30.1-32.71.mp3
 │   └── 30.1-32.71.mp3
 └── SPK2
 └── SPK2
-    ├── 38.79-40.85.lab
     └── 38.79-40.85.mp3
     └── 38.79-40.85.mp3
 ```
 ```
 
 
-你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`, 标注文件后缀可以为 `.lab` 或 `.txt`.
+你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
 
 
 ### 2. 分割训练集和验证集
 ### 2. 分割训练集和验证集
 
 
@@ -38,7 +34,6 @@
 python tools/vqgan/create_train_split.py data/demo
 python tools/vqgan/create_train_split.py data/demo
 ```
 ```
 
 
-
 该命令会在 `data/demo` 目录下创建 `data/demo/vq_train_filelist.txt` 和 `data/demo/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
 该命令会在 `data/demo` 目录下创建 `data/demo/vq_train_filelist.txt` 和 `data/demo/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
 
 
 !!!info
 !!!info
@@ -100,10 +95,13 @@ python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_fine
 ```bash
 ```bash
 huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
 huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
 ```
 ```
-对于中国大陆用户,可使用mirror下载。
+
+对于中国大陆用户, 可使用 mirror 下载.
+
 ```bash
 ```bash
 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
 ```
 ```
+
 随后可运行以下命令来提取语义 token:
 随后可运行以下命令来提取语义 token:
 
 
 ```bash
 ```bash
@@ -178,11 +176,15 @@ data_server/target/release/data_server \
 ```bash
 ```bash
 huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
 huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
 ```
 ```
-对于中国大陆用户,可使用mirror下载。
+
+对于中国大陆用户, 可使用 mirror 下载.
+
 ```bash
 ```bash
 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
 ```
 ```
+
 最后, 你可以运行以下命令来启动微调:
 最后, 你可以运行以下命令来启动微调:
+
 ```bash
 ```bash
 python fish_speech/train.py --config-name text2semantic_finetune_spk
 python fish_speech/train.py --config-name text2semantic_finetune_spk
 ```
 ```
@@ -190,7 +192,7 @@ python fish_speech/train.py --config-name text2semantic_finetune_spk
 !!! note
 !!! note
     你可以通过修改 `fish_speech/configs/text2semantic_finetune_spk.yaml` 来修改训练参数如 `batch_size`, `gradient_accumulation_steps` 等, 来适应你的显存.
     你可以通过修改 `fish_speech/configs/text2semantic_finetune_spk.yaml` 来修改训练参数如 `batch_size`, `gradient_accumulation_steps` 等, 来适应你的显存.
 
 
-训练结束后, 你可以参考推理部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
+训练结束后, 你可以参考 [推理](inference.md) 部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
 
 
 !!! info
 !!! info
     默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.  
     默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.  

+ 2 - 1
docs/zh/index.md

@@ -1,7 +1,7 @@
 # 介绍
 # 介绍
 
 
 !!! warning
 !!! warning
-    我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
+    我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.
 
 
 此代码库根据 `BSD-3-Clause` 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
 此代码库根据 `BSD-3-Clause` 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
 
 
@@ -33,6 +33,7 @@ pip3 install -e .
 
 
 ## 更新日志
 ## 更新日志
 
 
+- 2023/12/19: 更新了 Webui 和 HTTP API.
 - 2023/12/18: 更新了微调文档和相关例子.
 - 2023/12/18: 更新了微调文档和相关例子.
 - 2023/12/17: 更新了 `text2semantic` 模型, 支持无音素模式.
 - 2023/12/17: 更新了 `text2semantic` 模型, 支持无音素模式.
 - 2023/12/13: 测试版发布, 包含 VQGAN 模型和一个基于 LLAMA 的语言模型 (只支持音素).
 - 2023/12/13: 测试版发布, 包含 VQGAN 模型和一个基于 LLAMA 的语言模型 (只支持音素).

+ 1 - 1
docs/zh/inference.md

@@ -1,6 +1,6 @@
 # 推理
 # 推理
 
 
-计划中, 推理支持命令行, http api, 以及 webui 三种方式.  
+推理支持命令行, http api, 以及 webui 三种方式.  
 
 
 !!! note
 !!! note
     总的来说, 推理分为几个部分:  
     总的来说, 推理分为几个部分:  

+ 30 - 10
mkdocs.yml

@@ -1,14 +1,23 @@
 site_name: Fish Speech
 site_name: Fish Speech
+site_description: Targeting SOTA TTS solutions.
+site_url: https://speech.fish.audio
+
+# Repository
+repo_name: fishaudio/fish-speech
 repo_url: https://github.com/fishaudio/fish-speech
 repo_url: https://github.com/fishaudio/fish-speech
+edit_uri: blob/main/docs
+
+# Copyright
+copyright: Copyright © 2023-2024 by Fish Audio
 
 
 theme:
 theme:
   name: material
   name: material
   language: en
   language: en
   features:
   features:
-    - navigation.instant
-    - navigation.instant.prefetch
+    - content.action.edit
+    - content.action.view
     - navigation.tracking
     - navigation.tracking
-    - navigation.tabs
+    # - navigation.tabs
     - search
     - search
     - search.suggest
     - search.suggest
     - search.highlight
     - search.highlight
@@ -44,13 +53,24 @@ theme:
   
   
 extra:
 extra:
   homepage: https://speech.fish.audio
   homepage: https://speech.fish.audio
-  alternate:
-    - name: English
-      link: /en/ 
-      lang: en
-    - name: 中文
-      link: /zh/
-      lang: zh
+
+# Plugins
+plugins:
+  - search:
+      separator: '[\s\-,:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|(?!\b)(?=[A-Z][a-z])'
+      lang:
+        - zh
+        - en
+  - i18n:
+      docs_structure: folder
+      languages:
+        - locale: en
+          name: English
+          build: true
+        - locale: zh
+          default: true
+          name: 简体中文
+          build: true
 
 
 markdown_extensions:
 markdown_extensions:
   - pymdownx.highlight:
   - pymdownx.highlight: