há 2 anos atrás · c0585bff0f
--- a/README.md
+++ b/README.md
@@ -6,12 +6,15 @@ This codebase is released under BSD-3-Clause License, and all models are release
 
															 ## Disclaimer / 免责声明
														
 
															 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.  
														
 
															-我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
														
 
															+我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.
														
 
															 ## Documents / 文档
														
 
															 - [English](https://speech.fish.audio/en/)
														
 
															-- [中文](https://speech.fish.audio/zh/)
														
 
															+- [中文](https://speech.fish.audio/)
														
 
															+## Samples / 例子
														
 
															+- [English](https://speech.fish.audio/en/samples/)
														
 
															+- [中文](https://speech.fish.audio/samples/)
														
 
															 ## Credits / 鸣谢
														
 
															 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
														
--- a/docs/en/finetune.md
+++ b/docs/en/finetune.md
@@ -2,7 +2,63 @@
 
															 Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
														
 
															-`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`. Currently, we only support fine-tuning the `LLAMA` model.
														
 
															+`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`.
														
 
															+
														
 
															+!!! info 
														
 
															+    You should first conduct the following test to determine if you need to fine-tune `VQGAN`:
														
 
															+    ```bash
														
 
															+    python tools/vqgan/inference.py -i test.wav
														
 
															+    ```
														
 
															+    This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`.
														
 
															+
														
 
															+    Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
														
 
															+
														
 
															+## Fine-tuning VQGAN
														
 
															+### 1. Prepare the Dataset
														
 
															+
														
 
															+```
														
 
															+.
														
 
															+├── SPK1
														
 
															+│   ├── 21.15-26.44.mp3
														
 
															+│   ├── 27.51-29.98.mp3
														
 
															+│   └── 30.1-32.71.mp3
														
 
															+└── SPK2
														
 
															+    └── 38.79-40.85.mp3
														
 
															+```
														
 
															+
														
 
															+You need to format your dataset as shown above and place it under `data/demo`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
														
 
															+
														
 
															+### 2. Split Training and Validation Sets
														
 
															+
														
 
															+```bash
														
 
															+python tools/vqgan/create_train_split.py data/demo
														
 
															+```
														
 
															+
														
 
															+This command will create `data/demo/vq_train_filelist.txt` and `data/demo/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
														
 
															+
														
 
															+!!!info
														
 
															+    For the VITS format, you can specify a file list using `--filelist xxx.list`.
														
 
															+    Please note that the audio files in `filelist` must also be located in the `data/demo` folder.
														
 
															+
														
 
															+### 3. Start Training
														
 
															+
														
 
															+```bash
														
 
															+python fish_speech/train.py --config-name vqgan_finetune
														
 
															+```
														
 
															+
														
 
															+!!! note
														
 
															+    You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
														
 
															+
														
 
															+### 4. Test the Audio
														
 
															+    
														
 
															+```bash
														
 
															+python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
														
 
															+```
														
 
															+
														
 
															+You can review `fake.wav` to assess the fine-tuning results.
														
 
															+
														
 
															+!!! note
														
 
															+    You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
														
 
															 ## Fine-tuning LLAMA
														
 
															 ### 1. Prepare the dataset
														
@@ -26,7 +82,7 @@ You need to convert your dataset into the above format and place it under `data/
 
															 !!! note
														
 
															     You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
														
 
															-### 2. Batch-wise extraction of semantic tokens
														
 
															+### 2. Batch extraction of semantic tokens
														
 
															 Make sure you have downloaded the VQGAN weights. If not, run the following command:
														
@@ -76,6 +132,10 @@ python tools/llama/build_dataset.py \
 
															 After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
														
 
															+!!!info
														
 
															+    For the VITS format, you can specify a file list using `--filelist xxx.list`.
														
 
															+    Please note that the audio files referenced in `filelist` must also be located in the `data/demo` folder.
														
 
															+
														
 
															 ### 4. Start the Rust data server
														
 
															 Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
														
@@ -112,7 +172,7 @@ python fish_speech/train.py --config-name text2semantic_finetune_spk
 
															 !!! note
														
 
															     You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
														
 
															-After training is complete, you can refer to the inference section to generate speech.
														
 
															+After training is complete, you can refer to the [inference](inference.md) section, and use `--speaker SPK1` to generate speech.
														
 
															 !!! info
														
 
															     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
														
--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -33,6 +33,7 @@ pip3 install -e .
 
															 ## Changelog
														
 
															+- 2023/12/19: Updated webui and HTTP API.
														
 
															 - 2023/12/18: Updated fine-tuning documentation and related examples.
														
 
															 - 2023/12/17: Updated `text2semantic` model, supporting phoneme-free mode.
														
 
															 - 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only).
														
--- a/docs/en/inference.md
+++ b/docs/en/inference.md
@@ -1,6 +1,6 @@
 
															 # Inference
														
 
															-In the plan, inference is expected to support both command line and webui methods, but currently, only the command-line reasoning function has been completed.  
														
 
															+Inference support command line, HTTP API and web UI.
														
 
															 !!! note
														
 
															     Overall, reasoning consists of several parts:
														
@@ -57,3 +57,27 @@ python tools/vqgan/inference.py \
 
															     -i "codes_0.npy" \
														
 
															     --checkpoint-path "checkpoints/vqgan-v1.pth"
														
 
															 ```
														
 
															+
														
 
															+## HTTP API Inference
														
 
															+
														
 
															+We provide a HTTP API for inference. You can use the following command to start the server:
														
 
															+
														
 
															+```bash
														
 
															+python -m zibai tools.api_server:app --listen 127.0.0.1:8000
														
 
															+```
														
 
															+
														
 
															+After that, you can view and test the API at http://127.0.0.1:8000/docs.  
														
 
															+
														
 
															+Generally, you need to first call PUT /v1/models/default to load the model, and then use POST /v1/models/default/invoke for inference. For specific parameters, please refer to the API documentation.
														
 
															+
														
 
															+## WebUI Inference
														
 
															+
														
 
															+Before running the WebUI, you need to start the HTTP service as described above.
														
 
															+
														
 
															+Then you can start the WebUI using the following command:
														
 
															+
														
 
															+```bash
														
 
															+python fish_speech/webui/app.py
														
 
															+```
														
 
															+
														
 
															+Enjoy!
														
--- a/docs/en/samples.md
+++ b/docs/en/samples.md
@@ -1,4 +1,4 @@
 
															-# Example
														
 
															+# Samples
														
 
															 !!! note
														
 
															     Due to insufficient Japanese to English training data, we first phonemicize the text and then use it for generation.
														
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,4 +0,0 @@
 
															----
														
 
															-template: redirect.html
														
 
															-location: ./zh/
														
 
															----
														
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1 +1,3 @@
 
															 mkdocs-material
														
 
															+mkdocs-static-i18n[material]
														
 
															+mkdocs[i18n]
														
--- a/docs/stylesheets/extra.css
+++ b/docs/stylesheets/extra.css
@@ -1,6 +1,3 @@
 
															 .md-grid {
														
 
															   max-width: 1440px; 
														
 
															 }
														
 
															-.md-tabs {
														
 
															-  display: none;
														
 
															-}
														
--- a/docs/zh/finetune.md
+++ b/docs/zh/finetune.md
@@ -11,7 +11,7 @@
 
															     ```
														
 
															     该测试会生成一个 `fake.wav` 文件, 如果该文件的音色和说话人的音色不同, 或者质量不高, 你需要微调 `VQGAN`.
														
 
															-    相应的, 你可以参考 [推理](../inference/) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
														
 
															+    相应的, 你可以参考 [推理](inference.md) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
														
 
															 ## VQGAN 微调
														
 
															 ### 1. 准备数据集
														
@@ -19,18 +19,14 @@
 
															 ```
														
 
															 .
														
 
															 ├── SPK1
														
 
															-│   ├── 21.15-26.44.lab
														
 
															 │   ├── 21.15-26.44.mp3
														
 
															-│   ├── 27.51-29.98.lab
														
 
															 │   ├── 27.51-29.98.mp3
														
 
															-│   ├── 30.1-32.71.lab
														
 
															 │   └── 30.1-32.71.mp3
														
 
															 └── SPK2
														
 
															-    ├── 38.79-40.85.lab
														
 
															     └── 38.79-40.85.mp3
														
 
															 ```
														
 
															-你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`, 标注文件后缀可以为 `.lab` 或 `.txt`.
														
 
															+你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
														
 
															 ### 2. 分割训练集和验证集
														
@@ -38,7 +34,6 @@
 
															 python tools/vqgan/create_train_split.py data/demo
														
 
															 ```
														
 
															-
														
 
															 该命令会在 `data/demo` 目录下创建 `data/demo/vq_train_filelist.txt` 和 `data/demo/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
														
 
															 !!!info
														
@@ -100,10 +95,13 @@ python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_fine
 
															 ```bash
														
 
															 huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
														
 
															 ```
														
 
															-对于中国大陆用户，可使用mirror下载。
														
 
															+
														
 
															+对于中国大陆用户, 可使用 mirror 下载.
														
 
															+
														
 
															 ```bash
														
 
															 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
														
 
															 ```
														
 
															+
														
 
															 随后可运行以下命令来提取语义 token:
														
 
															 ```bash
														
@@ -178,11 +176,15 @@ data_server/target/release/data_server \
 
															 ```bash
														
 
															 huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
														
 
															 ```
														
 
															-对于中国大陆用户，可使用mirror下载。
														
 
															+
														
 
															+对于中国大陆用户, 可使用 mirror 下载.
														
 
															+
														
 
															 ```bash
														
 
															 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
														
 
															 ```
														
 
															+
														
 
															 最后, 你可以运行以下命令来启动微调:
														
 
															+
														
 
															 ```bash
														
 
															 python fish_speech/train.py --config-name text2semantic_finetune_spk
														
 
															 ```
														
@@ -190,7 +192,7 @@ python fish_speech/train.py --config-name text2semantic_finetune_spk
 
															 !!! note
														
 
															     你可以通过修改 `fish_speech/configs/text2semantic_finetune_spk.yaml` 来修改训练参数如 `batch_size`, `gradient_accumulation_steps` 等, 来适应你的显存.
														
 
															-训练结束后, 你可以参考推理部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
														
 
															+训练结束后, 你可以参考 [推理](inference.md) 部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
														
 
															 !!! info
														
 
															     默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.  
														
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -1,7 +1,7 @@
 
															 # 介绍
														
 
															 !!! warning
														
 
															-    我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
														
 
															+    我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.
														
 
															 此代码库根据 `BSD-3-Clause` 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
														
@@ -33,6 +33,7 @@ pip3 install -e .
 
															 ## 更新日志
														
 
															+- 2023/12/19: 更新了 Webui 和 HTTP API.
														
 
															 - 2023/12/18: 更新了微调文档和相关例子.
														
 
															 - 2023/12/17: 更新了 `text2semantic` 模型, 支持无音素模式.
														
 
															 - 2023/12/13: 测试版发布, 包含 VQGAN 模型和一个基于 LLAMA 的语言模型 (只支持音素).
														
--- a/docs/zh/inference.md
+++ b/docs/zh/inference.md
@@ -1,6 +1,6 @@
 
															 # 推理
														
 
															-计划中, 推理会支持命令行, http api, 以及 webui 三种方式.  
														
 
															+推理支持命令行, http api, 以及 webui 三种方式.  
														
 
															 !!! note
														
 
															     总的来说, 推理分为几个部分:  
														
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,14 +1,23 @@
 
															 site_name: Fish Speech
														
 
															+site_description: Targeting SOTA TTS solutions.
														
 
															+site_url: https://speech.fish.audio
														
 
															+
														
 
															+# Repository
														
 
															+repo_name: fishaudio/fish-speech
														
 
															 repo_url: https://github.com/fishaudio/fish-speech
														
 
															+edit_uri: blob/main/docs
														
 
															+
														
 
															+# Copyright
														
 
															+copyright: Copyright &copy; 2023-2024 by Fish Audio
														
 
															 theme:
														
 
															   name: material
														
 
															   language: en
														
 
															   features:
														
 
															-    - navigation.instant
														
 
															-    - navigation.instant.prefetch
														
 
															+    - content.action.edit
														
 
															+    - content.action.view
														
 
															     - navigation.tracking
														
 
															-    - navigation.tabs
														
 
															+    # - navigation.tabs
														
 
															     - search
														
 
															     - search.suggest
														
 
															     - search.highlight
														
@@ -44,13 +53,24 @@ theme:
 
															 extra:
														
 
															   homepage: https://speech.fish.audio
														
 
															-  alternate:
														
 
															-    - name: English
														
 
															-      link: /en/ 
														
 
															-      lang: en
														
 
															-    - name: 中文
														
 
															-      link: /zh/
														
 
															-      lang: zh
														
 
															+
														
 
															+# Plugins
														
 
															+plugins:
														
 
															+  - search:
														
 
															+      separator: '[\s\-,:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|(?!\b)(?=[A-Z][a-z])'
														
 
															+      lang:
														
 
															+        - zh
														
 
															+        - en
														
 
															+  - i18n:
														
 
															+      docs_structure: folder
														
 
															+      languages:
														
 
															+        - locale: en
														
 
															+          name: English
														
 
															+          build: true
														
 
															+        - locale: zh
														
 
															+          default: true
														
 
															+          name: 简体中文
														
 
															+          build: true
														
 
															 markdown_extensions:
														
 
															   - pymdownx.highlight: