2 jaren geleden · c0585bff0f
--- a/README.md
+++ b/README.md
@@ -6,12 +6,15 @@ This codebase is released under BSD-3-Clause License, and all models are release
 
				 
			
 
				 ## Disclaimer / 免责声明
			
 
				 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.  
			
 
				-我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
			
 
				+我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.
			
 
				 
			
 
				 ## Documents / 文档
			
 
				 - [English](https://speech.fish.audio/en/)
			
 
				-- [中文](https://speech.fish.audio/zh/)
			
 
				+- [中文](https://speech.fish.audio/)
			
 
				 
			
 
				+## Samples / 例子
			
 
				+- [English](https://speech.fish.audio/en/samples/)
			
 
				+- [中文](https://speech.fish.audio/samples/)
			
 
				 
			
 
				 ## Credits / 鸣谢
			
 
				 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
			
--- a/docs/en/finetune.md
+++ b/docs/en/finetune.md
@@ -2,7 +2,63 @@
 
				 
			
 
				 Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
			
 
				 
			
 
				-`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`. Currently, we only support fine-tuning the `LLAMA` model.
			
 
				+`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`.
			
 
				+
			
 
				+!!! info 
			
 
				+    You should first conduct the following test to determine if you need to fine-tune `VQGAN`:
			
 
				+    ```bash
			
 
				+    python tools/vqgan/inference.py -i test.wav
			
 
				+    ```
			
 
				+    This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`.
			
 
				+
			
 
				+    Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
			
 
				+
			
 
				+## Fine-tuning VQGAN
			
 
				+### 1. Prepare the Dataset
			
 
				+
			
 
				+```
			
 
				+.
			
 
				+├── SPK1
			
 
				+│   ├── 21.15-26.44.mp3
			
 
				+│   ├── 27.51-29.98.mp3
			
 
				+│   └── 30.1-32.71.mp3
			
 
				+└── SPK2
			
 
				+    └── 38.79-40.85.mp3
			
 
				+```
			
 
				+
			
 
				+You need to format your dataset as shown above and place it under `data/demo`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
			
 
				+
			
 
				+### 2. Split Training and Validation Sets
			
 
				+
			
 
				+```bash
			
 
				+python tools/vqgan/create_train_split.py data/demo
			
 
				+```
			
 
				+
			
 
				+This command will create `data/demo/vq_train_filelist.txt` and `data/demo/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
			
 
				+
			
 
				+!!!info
			
 
				+    For the VITS format, you can specify a file list using `--filelist xxx.list`.
			
 
				+    Please note that the audio files in `filelist` must also be located in the `data/demo` folder.
			
 
				+
			
 
				+### 3. Start Training
			
 
				+
			
 
				+```bash
			
 
				+python fish_speech/train.py --config-name vqgan_finetune
			
 
				+```
			
 
				+
			
 
				+!!! note
			
 
				+    You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
			
 
				+
			
 
				+### 4. Test the Audio
			
 
				+    
			
 
				+```bash
			
 
				+python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
			
 
				+```
			
 
				+
			
 
				+You can review `fake.wav` to assess the fine-tuning results.
			
 
				+
			
 
				+!!! note
			
 
				+    You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
			
 
				 
			
 
				 ## Fine-tuning LLAMA
			
 
				 ### 1. Prepare the dataset
			
@@ -26,7 +82,7 @@ You need to convert your dataset into the above format and place it under `data/
 
				 !!! note
			
 
				     You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
			
 
				 
			
 
				-### 2. Batch-wise extraction of semantic tokens
			
 
				+### 2. Batch extraction of semantic tokens
			
 
				 
			
 
				 Make sure you have downloaded the VQGAN weights. If not, run the following command:
			
 
				 
			
@@ -76,6 +132,10 @@ python tools/llama/build_dataset.py \
 
				 
			
 
				 After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
			
 
				 
			
 
				+!!!info
			
 
				+    For the VITS format, you can specify a file list using `--filelist xxx.list`.
			
 
				+    Please note that the audio files referenced in `filelist` must also be located in the `data/demo` folder.
			
 
				+
			
 
				 ### 4. Start the Rust data server
			
 
				 
			
 
				 Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
			
@@ -112,7 +172,7 @@ python fish_speech/train.py --config-name text2semantic_finetune_spk
 
				 !!! note
			
 
				     You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
			
 
				 
			
 
				-After training is complete, you can refer to the inference section to generate speech.
			
 
				+After training is complete, you can refer to the [inference](inference.md) section, and use `--speaker SPK1` to generate speech.
			
 
				 
			
 
				 !!! info
			
 
				     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
			
--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -33,6 +33,7 @@ pip3 install -e .
 
				 
			
 
				 ## Changelog
			
 
				 
			
 
				+- 2023/12/19: Updated webui and HTTP API.
			
 
				 - 2023/12/18: Updated fine-tuning documentation and related examples.
			
 
				 - 2023/12/17: Updated `text2semantic` model, supporting phoneme-free mode.
			
 
				 - 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only).
			
--- a/docs/en/inference.md
+++ b/docs/en/inference.md
@@ -1,6 +1,6 @@
 
				 # Inference
			
 
				 
			
 
				-In the plan, inference is expected to support both command line and webui methods, but currently, only the command-line reasoning function has been completed.  
			
 
				+Inference support command line, HTTP API and web UI.
			
 
				 
			
 
				 !!! note
			
 
				     Overall, reasoning consists of several parts:
			
@@ -57,3 +57,27 @@ python tools/vqgan/inference.py \
 
				     -i "codes_0.npy" \
			
 
				     --checkpoint-path "checkpoints/vqgan-v1.pth"
			
 
				 ```
			
 
				+
			
 
				+## HTTP API Inference
			
 
				+
			
 
				+We provide a HTTP API for inference. You can use the following command to start the server:
			
 
				+
			
 
				+```bash
			
 
				+python -m zibai tools.api_server:app --listen 127.0.0.1:8000
			
 
				+```
			
 
				+
			
 
				+After that, you can view and test the API at http://127.0.0.1:8000/docs.  
			
 
				+
			
 
				+Generally, you need to first call PUT /v1/models/default to load the model, and then use POST /v1/models/default/invoke for inference. For specific parameters, please refer to the API documentation.
			
 
				+
			
 
				+## WebUI Inference
			
 
				+
			
 
				+Before running the WebUI, you need to start the HTTP service as described above.
			
 
				+
			
 
				+Then you can start the WebUI using the following command:
			
 
				+
			
 
				+```bash
			
 
				+python fish_speech/webui/app.py
			
 
				+```
			
 
				+
			
 
				+Enjoy!
			
--- a/docs/en/samples.md
+++ b/docs/en/samples.md
@@ -1,4 +1,4 @@
 
				-# Example
			
 
				+# Samples
			
 
				 
			
 
				 !!! note
			
 
				     Due to insufficient Japanese to English training data, we first phonemicize the text and then use it for generation.
			
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,4 +0,0 @@
 
				----
			
 
				-template: redirect.html
			
 
				-location: ./zh/
			
 
				----
			
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1 +1,3 @@
 
				 mkdocs-material
			
 
				+mkdocs-static-i18n[material]
			
 
				+mkdocs[i18n]
			
--- a/docs/stylesheets/extra.css
+++ b/docs/stylesheets/extra.css
@@ -1,6 +1,3 @@
 
				 .md-grid {
			
 
				   max-width: 1440px; 
			
 
				 }
			
 
				-.md-tabs {
			
 
				-  display: none;
			
 
				-}
			
--- a/docs/zh/finetune.md
+++ b/docs/zh/finetune.md
@@ -11,7 +11,7 @@
 
				     ```
			
 
				     该测试会生成一个 `fake.wav` 文件, 如果该文件的音色和说话人的音色不同, 或者质量不高, 你需要微调 `VQGAN`.
			
 
				 
			
 
				-    相应的, 你可以参考 [推理](../inference/) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
			
 
				+    相应的, 你可以参考 [推理](inference.md) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
			
 
				 
			
 
				 ## VQGAN 微调
			
 
				 ### 1. 准备数据集
			
@@ -19,18 +19,14 @@
 
				 ```
			
 
				 .
			
 
				 ├── SPK1
			
 
				-│   ├── 21.15-26.44.lab
			
 
				 │   ├── 21.15-26.44.mp3
			
 
				-│   ├── 27.51-29.98.lab
			
 
				 │   ├── 27.51-29.98.mp3
			
 
				-│   ├── 30.1-32.71.lab
			
 
				 │   └── 30.1-32.71.mp3
			
 
				 └── SPK2
			
 
				-    ├── 38.79-40.85.lab
			
 
				     └── 38.79-40.85.mp3
			
 
				 ```
			
 
				 
			
 
				-你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`, 标注文件后缀可以为 `.lab` 或 `.txt`.
			
 
				+你需要将数据集转为以上格式, 并放到 `data/demo` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
			
 
				 
			
 
				 ### 2. 分割训练集和验证集
			
 
				 
			
@@ -38,7 +34,6 @@
 
				 python tools/vqgan/create_train_split.py data/demo
			
 
				 ```
			
 
				 
			
 
				-
			
 
				 该命令会在 `data/demo` 目录下创建 `data/demo/vq_train_filelist.txt` 和 `data/demo/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
			
 
				 
			
 
				 !!!info
			
@@ -100,10 +95,13 @@ python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_fine
 
				 ```bash
			
 
				 huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
			
 
				 ```
			
 
				-对于中国大陆用户，可使用mirror下载。
			
 
				+
			
 
				+对于中国大陆用户, 可使用 mirror 下载.
			
 
				+
			
 
				 ```bash
			
 
				 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 vqgan-v1.pth --local-dir checkpoints
			
 
				 ```
			
 
				+
			
 
				 随后可运行以下命令来提取语义 token:
			
 
				 
			
 
				 ```bash
			
@@ -178,11 +176,15 @@ data_server/target/release/data_server \
 
				 ```bash
			
 
				 huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
			
 
				 ```
			
 
				-对于中国大陆用户，可使用mirror下载。
			
 
				+
			
 
				+对于中国大陆用户, 可使用 mirror 下载.
			
 
				+
			
 
				 ```bash
			
 
				 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/speech-lm-v1 text2semantic-400m-v0.2-4k.pth --local-dir checkpoints
			
 
				 ```
			
 
				+
			
 
				 最后, 你可以运行以下命令来启动微调:
			
 
				+
			
 
				 ```bash
			
 
				 python fish_speech/train.py --config-name text2semantic_finetune_spk
			
 
				 ```
			
@@ -190,7 +192,7 @@ python fish_speech/train.py --config-name text2semantic_finetune_spk
 
				 !!! note
			
 
				     你可以通过修改 `fish_speech/configs/text2semantic_finetune_spk.yaml` 来修改训练参数如 `batch_size`, `gradient_accumulation_steps` 等, 来适应你的显存.
			
 
				 
			
 
				-训练结束后, 你可以参考推理部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
			
 
				+训练结束后, 你可以参考 [推理](inference.md) 部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
			
 
				 
			
 
				 !!! info
			
 
				     默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.  
			
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -1,7 +1,7 @@
 
				 # 介绍
			
 
				 
			
 
				 !!! warning
			
 
				-    我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律.
			
 
				+    我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规.
			
 
				 
			
 
				 此代码库根据 `BSD-3-Clause` 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
			
 
				 
			
@@ -33,6 +33,7 @@ pip3 install -e .
 
				 
			
 
				 ## 更新日志
			
 
				 
			
 
				+- 2023/12/19: 更新了 Webui 和 HTTP API.
			
 
				 - 2023/12/18: 更新了微调文档和相关例子.
			
 
				 - 2023/12/17: 更新了 `text2semantic` 模型, 支持无音素模式.
			
 
				 - 2023/12/13: 测试版发布, 包含 VQGAN 模型和一个基于 LLAMA 的语言模型 (只支持音素).
			
--- a/docs/zh/inference.md
+++ b/docs/zh/inference.md
@@ -1,6 +1,6 @@
 
				 # 推理
			
 
				 
			
 
				-计划中, 推理会支持命令行, http api, 以及 webui 三种方式.  
			
 
				+推理支持命令行, http api, 以及 webui 三种方式.  
			
 
				 
			
 
				 !!! note
			
 
				     总的来说, 推理分为几个部分:  
			
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,14 +1,23 @@
 
				 site_name: Fish Speech
			
 
				+site_description: Targeting SOTA TTS solutions.
			
 
				+site_url: https://speech.fish.audio
			
 
				+
			
 
				+# Repository
			
 
				+repo_name: fishaudio/fish-speech
			
 
				 repo_url: https://github.com/fishaudio/fish-speech
			
 
				+edit_uri: blob/main/docs
			
 
				+
			
 
				+# Copyright
			
 
				+copyright: Copyright &copy; 2023-2024 by Fish Audio
			
 
				 
			
 
				 theme:
			
 
				   name: material
			
 
				   language: en
			
 
				   features:
			
 
				-    - navigation.instant
			
 
				-    - navigation.instant.prefetch
			
 
				+    - content.action.edit
			
 
				+    - content.action.view
			
 
				     - navigation.tracking
			
 
				-    - navigation.tabs
			
 
				+    # - navigation.tabs
			
 
				     - search
			
 
				     - search.suggest
			
 
				     - search.highlight
			
@@ -44,13 +53,24 @@ theme:
 
				   
			
 
				 extra:
			
 
				   homepage: https://speech.fish.audio
			
 
				-  alternate:
			
 
				-    - name: English
			
 
				-      link: /en/ 
			
 
				-      lang: en
			
 
				-    - name: 中文
			
 
				-      link: /zh/
			
 
				-      lang: zh
			
 
				+
			
 
				+# Plugins
			
 
				+plugins:
			
 
				+  - search:
			
 
				+      separator: '[\s\-,:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|(?!\b)(?=[A-Z][a-z])'
			
 
				+      lang:
			
 
				+        - zh
			
 
				+        - en
			
 
				+  - i18n:
			
 
				+      docs_structure: folder
			
 
				+      languages:
			
 
				+        - locale: en
			
 
				+          name: English
			
 
				+          build: true
			
 
				+        - locale: zh
			
 
				+          default: true
			
 
				+          name: 简体中文
			
 
				+          build: true
			
 
				 
			
 
				 markdown_extensions:
			
 
				   - pymdownx.highlight: