-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
320 additions
and
179 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,3 +23,4 @@ asr-label* | |
/.cache | ||
/fishenv | ||
/.locale | ||
/demo-audios |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,65 +2,22 @@ | |
|
||
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset. | ||
|
||
`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`and `VITS`. | ||
`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`, and `VITS`. | ||
|
||
!!! info | ||
You should first conduct the following test to determine if you need to fine-tune `VQGAN`: | ||
You should first conduct the following test to determine if you need to fine-tune `VITS Decoder`: | ||
```bash | ||
python tools/vqgan/inference.py -i test.wav | ||
python tools/vits_decoder/inference.py \ | ||
-ckpt checkpoints/vits_decoder_v1.1.ckpt \ | ||
-i fake.npy -r test.wav \ | ||
--text "The text you want to generate" | ||
``` | ||
This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`. | ||
This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VITS Decoder`. | ||
|
||
Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`. | ||
|
||
It is recommended to fine-tune the LLAMA and VITS model first, then fine-tune the `VQGAN` according to your needs. | ||
|
||
## Fine-tuning VQGAN | ||
### 1. Prepare the Dataset | ||
|
||
``` | ||
. | ||
├── SPK1 | ||
│ ├── 21.15-26.44.mp3 | ||
│ ├── 27.51-29.98.mp3 | ||
│ └── 30.1-32.71.mp3 | ||
└── SPK2 | ||
└── 38.79-40.85.mp3 | ||
``` | ||
|
||
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions. | ||
|
||
### 2. Split Training and Validation Sets | ||
|
||
```bash | ||
python tools/vqgan/create_train_split.py data | ||
``` | ||
|
||
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively. | ||
|
||
!!!info | ||
For the VITS format, you can specify a file list using `--filelist xxx.list`. | ||
Please note that the audio files in `filelist` must also be located in the `data` folder. | ||
|
||
### 3. Start Training | ||
|
||
```bash | ||
python fish_speech/train.py --config-name vqgan_finetune | ||
``` | ||
|
||
!!! note | ||
You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary. | ||
|
||
### 4. Test the Audio | ||
|
||
```bash | ||
python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt | ||
``` | ||
|
||
You can review `fake.wav` to assess the fine-tuning results. | ||
|
||
!!! note | ||
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data. | ||
It is recommended to fine-tune the LLAMA first, then fine-tune the `VITS Decoder` according to your needs. | ||
|
||
## Fine-tuning LLAMA | ||
### 1. Prepare the dataset | ||
|
@@ -168,8 +125,27 @@ After training is complete, you can refer to the [inference](inference.md) secti | |
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability. | ||
If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting. | ||
|
||
## Fine-tuning VITS | ||
### 1. Prepare the dataset | ||
#### Fine-tuning with LoRA | ||
|
||
!!! note | ||
LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets. | ||
|
||
If you want to use LoRA, please add the following parameter: `[email protected]_config=r_8_alpha_16`. | ||
|
||
After training, you need to convert the LoRA weights to regular weights before performing inference. | ||
|
||
```bash | ||
python tools/llama/merge_lora.py \ | ||
--llama-config dual_ar_2_codebook_medium \ | ||
--lora-config r_8_alpha_16 \ | ||
--llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \ | ||
--lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \ | ||
--output checkpoints/merged.ckpt | ||
``` | ||
|
||
|
||
## Fine-tuning VITS Decoder | ||
### 1. Prepare the Dataset | ||
|
||
``` | ||
. | ||
|
@@ -184,32 +160,92 @@ After training is complete, you can refer to the [inference](inference.md) secti | |
├── 38.79-40.85.lab | ||
└── 38.79-40.85.mp3 | ||
``` | ||
|
||
!!! note | ||
The fine-tuning for VITS only support the .lab format files, please don't use .list file! | ||
VITS fine-tuning currently only supports `.lab` as the label file and does not support the `filelist` format. | ||
|
||
You need to convert the dataset to the format above, and move them to the `data` , the suffix of the files can be `.mp3`, `.wav` 或 `.flac`, the label files' suffix are recommended to be `.lab`. | ||
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions, and the annotation files should have the `.lab` extension. | ||
|
||
### 2.Start Training | ||
### 2. Split Training and Validation Sets | ||
|
||
```bash | ||
python fish_speech/train.py --config-name vits_decoder_finetune | ||
python tools/vqgan/create_train_split.py data | ||
``` | ||
|
||
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively. | ||
|
||
#### Fine-tuning with LoRA | ||
!!! info | ||
For the VITS format, you can specify a file list using `--filelist xxx.list`. | ||
Please note that the audio files in `filelist` must also be located in the `data` folder. | ||
|
||
### 3. Start Training | ||
|
||
```bash | ||
python fish_speech/train.py --config-name vits_decoder_finetune | ||
``` | ||
|
||
!!! note | ||
LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets. | ||
You can modify training parameters by editing `fish_speech/configs/vits_decoder_finetune.yaml`, but in most cases, this won't be necessary. | ||
|
||
If you want to use LoRA, please add the following parameter: `[email protected]_config=r_8_alpha_16`. | ||
### 4. Test the Audio | ||
|
||
```bash | ||
python tools/vits_decoder/inference.py \ | ||
--checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \ | ||
-i test.npy -r test.wav \ | ||
--text "The text you want to generate" | ||
``` | ||
|
||
After training, you need to convert the LoRA weights to regular weights before performing inference. | ||
You can review `fake.wav` to assess the fine-tuning results. | ||
|
||
|
||
## Fine-tuning VQGAN (Not Recommended) | ||
|
||
|
||
We no longer recommend using VQGAN for fine-tuning in version 1.1. Using VITS Decoder will yield better results, but if you still want to fine-tune VQGAN, you can refer to the following steps. | ||
|
||
### 1. Prepare the Dataset | ||
|
||
``` | ||
. | ||
├── SPK1 | ||
│ ├── 21.15-26.44.mp3 | ||
│ ├── 27.51-29.98.mp3 | ||
│ └── 30.1-32.71.mp3 | ||
└── SPK2 | ||
└── 38.79-40.85.mp3 | ||
``` | ||
|
||
You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions. | ||
|
||
### 2. Split Training and Validation Sets | ||
|
||
```bash | ||
python tools/llama/merge_lora.py \ | ||
--llama-config dual_ar_2_codebook_medium \ | ||
--lora-config r_8_alpha_16 \ | ||
--llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \ | ||
--lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \ | ||
--output checkpoints/merged.ckpt | ||
python tools/vqgan/create_train_split.py data | ||
``` | ||
|
||
This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively. | ||
|
||
!!!info | ||
For the VITS format, you can specify a file list using `--filelist xxx.list`. | ||
Please note that the audio files in `filelist` must also be located in the `data` folder. | ||
|
||
### 3. Start Training | ||
|
||
```bash | ||
python fish_speech/train.py --config-name vqgan_finetune | ||
``` | ||
|
||
!!! note | ||
You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary. | ||
|
||
### 4. Test the Audio | ||
|
||
```bash | ||
python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt | ||
``` | ||
|
||
You can review `fake.wav` to assess the fine-tuning results. | ||
|
||
!!! note | ||
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.