|
@@ -2,22 +2,7 @@
|
|
|
|
|
|
|
|
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
|
|
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
|
|
|
|
|
|
|
|
-`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`, and `VITS`.
|
|
|
|
|
-
|
|
|
|
|
-!!! info
|
|
|
|
|
- You should first conduct the following test to determine if you need to fine-tune `VITS Decoder`:
|
|
|
|
|
- ```bash
|
|
|
|
|
- python tools/vqgan/inference.py -i test.wav
|
|
|
|
|
- python tools/vits_decoder/inference.py \
|
|
|
|
|
- -ckpt checkpoints/vits_decoder_v1.1.ckpt \
|
|
|
|
|
- -i fake.npy -r test.wav \
|
|
|
|
|
- --text "The text you want to generate"
|
|
|
|
|
- ```
|
|
|
|
|
- This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VITS Decoder`.
|
|
|
|
|
-
|
|
|
|
|
- Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
|
|
|
|
|
-
|
|
|
|
|
- It is recommended to fine-tune the LLAMA first, then fine-tune the `VITS Decoder` according to your needs.
|
|
|
|
|
|
|
+In current version, you only need to finetune the 'LLAMA' part.
|
|
|
|
|
|
|
|
## Fine-tuning LLAMA
|
|
## Fine-tuning LLAMA
|
|
|
### 1. Prepare the dataset
|
|
### 1. Prepare the dataset
|
|
@@ -51,7 +36,7 @@ You need to convert your dataset into the above format and place it under `data`
|
|
|
Make sure you have downloaded the VQGAN weights. If not, run the following command:
|
|
Make sure you have downloaded the VQGAN weights. If not, run the following command:
|
|
|
|
|
|
|
|
```bash
|
|
```bash
|
|
|
-huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
|
|
|
|
|
|
|
+huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
You can then run the following command to extract semantic tokens:
|
|
You can then run the following command to extract semantic tokens:
|
|
@@ -125,7 +110,7 @@ After training is complete, you can refer to the [inference](inference.md) secti
|
|
|
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
|
|
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
|
|
|
If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
|
|
If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
|
|
|
|
|
|
|
|
-#### Fine-tuning with LoRA
|
|
|
|
|
|
|
+#### Fine-tuning with LoRA (recommend)
|
|
|
|
|
|
|
|
!!! note
|
|
!!! note
|
|
|
LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets.
|
|
LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets.
|
|
@@ -143,109 +128,5 @@ python tools/llama/merge_lora.py \
|
|
|
--output checkpoints/merged.ckpt
|
|
--output checkpoints/merged.ckpt
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-
|
|
|
|
|
-## Fine-tuning VITS Decoder
|
|
|
|
|
-### 1. Prepare the Dataset
|
|
|
|
|
-
|
|
|
|
|
-```
|
|
|
|
|
-.
|
|
|
|
|
-├── SPK1
|
|
|
|
|
-│ ├── 21.15-26.44.lab
|
|
|
|
|
-│ ├── 21.15-26.44.mp3
|
|
|
|
|
-│ ├── 27.51-29.98.lab
|
|
|
|
|
-│ ├── 27.51-29.98.mp3
|
|
|
|
|
-│ ├── 30.1-32.71.lab
|
|
|
|
|
-│ └── 30.1-32.71.mp3
|
|
|
|
|
-└── SPK2
|
|
|
|
|
- ├── 38.79-40.85.lab
|
|
|
|
|
- └── 38.79-40.85.mp3
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-!!! note
|
|
|
|
|
- VITS fine-tuning currently only supports `.lab` as the label file and does not support the `filelist` format.
|
|
|
|
|
-
|
|
|
|
|
-You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions, and the annotation files should have the `.lab` extension.
|
|
|
|
|
-
|
|
|
|
|
-### 2. Split Training and Validation Sets
|
|
|
|
|
-
|
|
|
|
|
-```bash
|
|
|
|
|
-python tools/vqgan/create_train_split.py data
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
|
|
|
|
|
-
|
|
|
|
|
-!!! info
|
|
|
|
|
- For the VITS format, you can specify a file list using `--filelist xxx.list`.
|
|
|
|
|
- Please note that the audio files in `filelist` must also be located in the `data` folder.
|
|
|
|
|
-
|
|
|
|
|
-### 3. Start Training
|
|
|
|
|
-
|
|
|
|
|
-```bash
|
|
|
|
|
-python fish_speech/train.py --config-name vits_decoder_finetune
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-!!! note
|
|
|
|
|
- You can modify training parameters by editing `fish_speech/configs/vits_decoder_finetune.yaml`, but in most cases, this won't be necessary.
|
|
|
|
|
-
|
|
|
|
|
-### 4. Test the Audio
|
|
|
|
|
-
|
|
|
|
|
-```bash
|
|
|
|
|
-python tools/vits_decoder/inference.py \
|
|
|
|
|
- --checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \
|
|
|
|
|
- -i test.npy -r test.wav \
|
|
|
|
|
- --text "The text you want to generate"
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-You can review `fake.wav` to assess the fine-tuning results.
|
|
|
|
|
-
|
|
|
|
|
-
|
|
|
|
|
-## Fine-tuning VQGAN (Not Recommended)
|
|
|
|
|
-
|
|
|
|
|
-
|
|
|
|
|
-We no longer recommend using VQGAN for fine-tuning in version 1.1. Using VITS Decoder will yield better results, but if you still want to fine-tune VQGAN, you can refer to the following steps.
|
|
|
|
|
-
|
|
|
|
|
-### 1. Prepare the Dataset
|
|
|
|
|
-
|
|
|
|
|
-```
|
|
|
|
|
-.
|
|
|
|
|
-├── SPK1
|
|
|
|
|
-│ ├── 21.15-26.44.mp3
|
|
|
|
|
-│ ├── 27.51-29.98.mp3
|
|
|
|
|
-│ └── 30.1-32.71.mp3
|
|
|
|
|
-└── SPK2
|
|
|
|
|
- └── 38.79-40.85.mp3
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
|
|
|
|
|
-
|
|
|
|
|
-### 2. Split Training and Validation Sets
|
|
|
|
|
-
|
|
|
|
|
-```bash
|
|
|
|
|
-python tools/vqgan/create_train_split.py data
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
|
|
|
|
|
-
|
|
|
|
|
-!!!info
|
|
|
|
|
- For the VITS format, you can specify a file list using `--filelist xxx.list`.
|
|
|
|
|
- Please note that the audio files in `filelist` must also be located in the `data` folder.
|
|
|
|
|
-
|
|
|
|
|
-### 3. Start Training
|
|
|
|
|
-
|
|
|
|
|
-```bash
|
|
|
|
|
-python fish_speech/train.py --config-name firefly_gan_vq
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-!!! note
|
|
|
|
|
- You can modify training parameters by editing `fish_speech/configs/firefly_gan_vq.yaml`, but in most cases, this won't be necessary.
|
|
|
|
|
-
|
|
|
|
|
-### 4. Test the Audio
|
|
|
|
|
-
|
|
|
|
|
-```bash
|
|
|
|
|
-python tools/vqgan/inference.py -i test.wav --checkpoint-path results/firefly_gan_vq/checkpoints/step_000010000.ckpt
|
|
|
|
|
-```
|
|
|
|
|
-
|
|
|
|
|
-You can review `fake.wav` to assess the fine-tuning results.
|
|
|
|
|
-
|
|
|
|
|
!!! note
|
|
!!! note
|
|
|
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
|
|
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
|