|
@@ -2,7 +2,63 @@
|
|
|
|
|
|
|
|
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
|
|
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
|
|
|
|
|
|
|
|
-`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`. Currently, we only support fine-tuning the `LLAMA` model.
|
|
|
|
|
|
|
+`Fish Speech` consists of two modules: `VQGAN` and `LLAMA`.
|
|
|
|
|
+
|
|
|
|
|
+!!! info
|
|
|
|
|
+ You should first conduct the following test to determine if you need to fine-tune `VQGAN`:
|
|
|
|
|
+ ```bash
|
|
|
|
|
+ python tools/vqgan/inference.py -i test.wav
|
|
|
|
|
+ ```
|
|
|
|
|
+ This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`.
|
|
|
|
|
+
|
|
|
|
|
+ Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
|
|
|
|
|
+
|
|
|
|
|
+## Fine-tuning VQGAN
|
|
|
|
|
+### 1. Prepare the Dataset
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+.
|
|
|
|
|
+├── SPK1
|
|
|
|
|
+│ ├── 21.15-26.44.mp3
|
|
|
|
|
+│ ├── 27.51-29.98.mp3
|
|
|
|
|
+│ └── 30.1-32.71.mp3
|
|
|
|
|
+└── SPK2
|
|
|
|
|
+ └── 38.79-40.85.mp3
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+You need to format your dataset as shown above and place it under `data/demo`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
|
|
|
|
|
+
|
|
|
|
|
+### 2. Split Training and Validation Sets
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+python tools/vqgan/create_train_split.py data/demo
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+This command will create `data/demo/vq_train_filelist.txt` and `data/demo/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
|
|
|
|
|
+
|
|
|
|
|
+!!!info
|
|
|
|
|
+ For the VITS format, you can specify a file list using `--filelist xxx.list`.
|
|
|
|
|
+ Please note that the audio files in `filelist` must also be located in the `data/demo` folder.
|
|
|
|
|
+
|
|
|
|
|
+### 3. Start Training
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+python fish_speech/train.py --config-name vqgan_finetune
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+!!! note
|
|
|
|
|
+ You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
|
|
|
|
|
+
|
|
|
|
|
+### 4. Test the Audio
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+You can review `fake.wav` to assess the fine-tuning results.
|
|
|
|
|
+
|
|
|
|
|
+!!! note
|
|
|
|
|
+ You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
|
|
|
|
|
|
|
|
## Fine-tuning LLAMA
|
|
## Fine-tuning LLAMA
|
|
|
### 1. Prepare the dataset
|
|
### 1. Prepare the dataset
|
|
@@ -26,7 +82,7 @@ You need to convert your dataset into the above format and place it under `data/
|
|
|
!!! note
|
|
!!! note
|
|
|
You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
|
|
You can modify the dataset path and mix datasets by modifying `fish_speech/configs/data/finetune.yaml`.
|
|
|
|
|
|
|
|
-### 2. Batch-wise extraction of semantic tokens
|
|
|
|
|
|
|
+### 2. Batch extraction of semantic tokens
|
|
|
|
|
|
|
|
Make sure you have downloaded the VQGAN weights. If not, run the following command:
|
|
Make sure you have downloaded the VQGAN weights. If not, run the following command:
|
|
|
|
|
|
|
@@ -76,6 +132,10 @@ python tools/llama/build_dataset.py \
|
|
|
|
|
|
|
|
After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
|
|
After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
|
|
|
|
|
|
|
|
|
|
+!!!info
|
|
|
|
|
+ For the VITS format, you can specify a file list using `--filelist xxx.list`.
|
|
|
|
|
+ Please note that the audio files referenced in `filelist` must also be located in the `data/demo` folder.
|
|
|
|
|
+
|
|
|
### 4. Start the Rust data server
|
|
### 4. Start the Rust data server
|
|
|
|
|
|
|
|
Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
|
|
Loading and shuffling the dataset is very slow and memory-consuming. Therefore, we use a Rust server to load and shuffle the data. This server is based on GRPC and can be installed using the following method:
|
|
@@ -112,7 +172,7 @@ python fish_speech/train.py --config-name text2semantic_finetune_spk
|
|
|
!!! note
|
|
!!! note
|
|
|
You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
|
|
You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune_spk.yaml`.
|
|
|
|
|
|
|
|
-After training is complete, you can refer to the inference section to generate speech.
|
|
|
|
|
|
|
+After training is complete, you can refer to the [inference](inference.md) section, and use `--speaker SPK1` to generate speech.
|
|
|
|
|
|
|
|
!!! info
|
|
!!! info
|
|
|
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
|
|
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
|