Modify the documents (#303)

* Modify the documents * fix model name
fishaudio · Jul 3, 2024 · 90fc0be · 90fc0be
1 parent 6259235
commit 90fc0be
Show file tree

Hide file tree

Showing 6 changed files with 34 additions and 326 deletions.
diff --git a/docs/en/finetune.md b/docs/en/finetune.md
@@ -2,22 +2,7 @@
 
 Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
 
-`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`, and `VITS`.
-
-!!! info 
-    You should first conduct the following test to determine if you need to fine-tune `VITS Decoder`:
-    ```bash
-    python tools/vqgan/inference.py -i test.wav
-    python tools/vits_decoder/inference.py \
-        -ckpt checkpoints/vits_decoder_v1.1.ckpt \
-        -i fake.npy -r test.wav \
-        --text "The text you want to generate"
-    ```
-    This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VITS Decoder`.
-
-    Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
-
-    It is recommended to fine-tune the LLAMA first, then fine-tune the `VITS Decoder` according to your needs.
+In current version, you only need to finetune the 'LLAMA' part.
 
 ## Fine-tuning LLAMA
 ### 1. Prepare the dataset
@@ -51,7 +36,7 @@ You need to convert your dataset into the above format and place it under `data`
 Make sure you have downloaded the VQGAN weights. If not, run the following command:
 
 ```bash
-huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
 ```
 
 You can then run the following command to extract semantic tokens:
@@ -125,7 +110,7 @@ After training is complete, you can refer to the [inference](inference.md) secti
     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
     If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
 
-#### Fine-tuning with LoRA
+#### Fine-tuning with LoRA (recommend)
 
 !!! note
     LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets. 
@@ -143,109 +128,5 @@ python tools/llama/merge_lora.py \
     --output checkpoints/merged.ckpt
 ```
 
-
-## Fine-tuning VITS Decoder
-### 1. Prepare the Dataset
-
-```
-.
-├── SPK1
-│   ├── 21.15-26.44.lab
-│   ├── 21.15-26.44.mp3
-│   ├── 27.51-29.98.lab
-│   ├── 27.51-29.98.mp3
-│   ├── 30.1-32.71.lab
-│   └── 30.1-32.71.mp3
-└── SPK2
-    ├── 38.79-40.85.lab
-    └── 38.79-40.85.mp3
-```
-
-!!! note
-    VITS fine-tuning currently only supports `.lab` as the label file and does not support the `filelist` format.
-
-You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions, and the annotation files should have the `.lab` extension.
-
-### 2. Split Training and Validation Sets
-
-```bash
-python tools/vqgan/create_train_split.py data
-```
-
-This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
-
-!!! info
-    For the VITS format, you can specify a file list using `--filelist xxx.list`.  
-    Please note that the audio files in `filelist` must also be located in the `data` folder.
-
-### 3. Start Training
-
-```bash
-python fish_speech/train.py --config-name vits_decoder_finetune
-```
-
-!!! note
-    You can modify training parameters by editing `fish_speech/configs/vits_decoder_finetune.yaml`, but in most cases, this won't be necessary.
-
-### 4. Test the Audio
-
-```bash
-python tools/vits_decoder/inference.py \
-    --checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \
-    -i test.npy -r test.wav \
-    --text "The text you want to generate"
-```
-
-You can review `fake.wav` to assess the fine-tuning results.
-
-
-## Fine-tuning VQGAN (Not Recommended)
-
-
-We no longer recommend using VQGAN for fine-tuning in version 1.1. Using VITS Decoder will yield better results, but if you still want to fine-tune VQGAN, you can refer to the following steps.
-
-### 1. Prepare the Dataset
-
-```
-.
-├── SPK1
-│   ├── 21.15-26.44.mp3
-│   ├── 27.51-29.98.mp3
-│   └── 30.1-32.71.mp3
-└── SPK2
-    └── 38.79-40.85.mp3
-```
-
-You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
-
-### 2. Split Training and Validation Sets
-
-```bash
-python tools/vqgan/create_train_split.py data
-```
-
-This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
-
-!!!info
-    For the VITS format, you can specify a file list using `--filelist xxx.list`.  
-    Please note that the audio files in `filelist` must also be located in the `data` folder.
-
-### 3. Start Training
-
-```bash
-python fish_speech/train.py --config-name firefly_gan_vq
-```
-
-!!! note
-    You can modify training parameters by editing `fish_speech/configs/firefly_gan_vq.yaml`, but in most cases, this won't be necessary.
-
-### 4. Test the Audio
-
-```bash
-python tools/vqgan/inference.py -i test.wav --checkpoint-path results/firefly_gan_vq/checkpoints/step_000010000.ckpt
-```
-
-You can review `fake.wav` to assess the fine-tuning results.
-
 !!! note
     You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
diff --git a/docs/en/index.md b/docs/en/index.md
@@ -107,6 +107,7 @@ apt install libsox-dev
 
 ## Changelog
 
+- 2024/07/02: Updated Fish-Speech to 1.2 version, remove VITS Decoder, and greatly enhanced zero-shot ability.
 - 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity.
 - 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models.
 - 2023/12/28: Added `lora` fine-tuning support.

diff --git a/docs/en/inference.md b/docs/en/inference.md
@@ -10,17 +10,13 @@ Inference support command line, HTTP API and web UI.
     3. Given a new piece of text, let the model generate the corresponding semantic tokens.
     4. Input the generated semantic tokens into VITS / VQGAN to decode and generate the corresponding voice.
 
-In version 1.1, we recommend using VITS for decoding, as it performs better than VQGAN in both timbre and pronunciation.
-
 ## Command Line Inference
 
-Download the required `vqgan` and `text2semantic` models from our Hugging Face repository.
+Download the required `vqgan` and `llama` models from our Hugging Face repository.
 
 ```bash
-huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
-huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
-huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
-huggingface-cli download fishaudio/fish-speech-1 firefly-gan-base-generator.ckpt --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1.2 firefly-gan-vq-fsq-4x1024-42hz-generator.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1.2 model.pth --local-dir checkpoints
 ```
 
 ### 1. Generate prompt from voice:
@@ -42,7 +38,7 @@ python tools/llama/generate.py \
     --prompt-text "Your reference text" \
     --prompt-tokens "fake.npy" \
     --config-name dual_ar_2_codebook_medium \
-    --checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
+    --checkpoint-path "checkpoints/model.pth" \
     --num-samples 2 \
     --compile
 ```
@@ -61,14 +57,6 @@ This command will create a `codes_N` file in the working directory, where N is a
 
 ### 3. Generate vocals from semantic tokens:
 
-#### VITS Decoder
-```bash
-python tools/vits_decoder/inference.py \
-    --checkpoint-path checkpoints/vits_decoder_v1.1.ckpt \
-    -i codes_0.npy -r ref.wav \
-    --text "The text you want to generate"
-```
-
 #### VQGAN Decoder (not recommended)
 ```bash
 python tools/vqgan/inference.py \
@@ -83,42 +71,25 @@ We provide a HTTP API for inference. You can use the following command to start
 ```bash
 python -m tools.api \
     --listen 0.0.0.0:8000 \
-    --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
-    --llama-config-name dual_ar_2_codebook_medium \
+    --llama-checkpoint-path "checkpoints/model.pth" \
+    --llama-config-name dual_ar_4_codebook_medium \
     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
     --decoder-config-name firefly_gan_vq
-```
 
 After that, you can view and test the API at http://127.0.0.1:8000/.  
 
-!!! info
-    You should use following parameters to start VITS decoder:
-
-    ```bash
-    --decoder-config-name vits_decoder_finetune \
-    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
-    ```
-
 ## WebUI Inference
 
 You can start the WebUI using the following command:
 
 ```bash
 python -m tools.webui \
-    --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
-    --llama-config-name dual_ar_2_codebook_medium \
-    --vqgan-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
-    --vits-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt"
+    --llama-checkpoint-path "checkpoints/model.pth" \
+    --llama-config-name dual_ar_4_codebook_medium \
+    --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
+    --decoder-config-name firefly_gan_vq
 ```
 
-!!! info
-    You should use following parameters to start VITS decoder:
-
-    ```bash
-    --decoder-config-name vits_decoder_finetune \
-    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
-    ```
-
 !!! note
     You can use Gradio environment variables, such as `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` to configure WebUI.