From b96ebdbcf6772f814189476b63c1f9f868260e42 Mon Sep 17 00:00:00 2001
From: Lengyue <lengyue@lengyue.me>
Date: Sat, 11 May 2024 10:12:44 -0400
Subject: [PATCH] Update docs and samples

---
 .gitignore           |   1 +
 docs/en/finetune.md  | 170 ++++++++++++++++++++++++++-----------------
 docs/en/index.md     |   4 +-
 docs/en/inference.md |  42 ++++++++++-
 docs/en/samples.md   |  36 ++++-----
 docs/zh/finetune.md  | 161 ++++++++++++++++++++++++----------------
 docs/zh/index.md     |   4 +-
 docs/zh/inference.md |  45 +++++++++++-
 docs/zh/samples.md   |  36 ++++-----
 9 files changed, 320 insertions(+), 179 deletions(-)

diff --git a/.gitignore b/.gitignore
index 88dcad8c..c1ab711f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -23,3 +23,4 @@ asr-label*
 /.cache
 /fishenv
 /.locale
+/demo-audios
diff --git a/docs/en/finetune.md b/docs/en/finetune.md
index d433b905..1537ecda 100644
--- a/docs/en/finetune.md
+++ b/docs/en/finetune.md
@@ -2,65 +2,22 @@
 
 Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
 
-`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`and `VITS`.
+`Fish Speech` consists of three modules: `VQGAN`, `LLAMA`, and `VITS`.
 
 !!! info 
-    You should first conduct the following test to determine if you need to fine-tune `VQGAN`:
+    You should first conduct the following test to determine if you need to fine-tune `VITS Decoder`:
     ```bash
     python tools/vqgan/inference.py -i test.wav
+    python tools/vits_decoder/inference.py \
+        -ckpt checkpoints/vits_decoder_v1.1.ckpt \
+        -i fake.npy -r test.wav \
+        --text "The text you want to generate"
     ```
-    This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VQGAN`.
+    This test will generate a `fake.wav` file. If the timbre of this file differs from the speaker's original voice, or if the quality is not high, you need to fine-tune `VITS Decoder`.
 
     Similarly, you can refer to [Inference](inference.md) to run `generate.py` and evaluate if the prosody meets your expectations. If it does not, then you need to fine-tune `LLAMA`.
 	
-    It is recommended to fine-tune the LLAMA and VITS model first, then fine-tune the `VQGAN` according to your needs.
-
-## Fine-tuning VQGAN
-### 1. Prepare the Dataset
-
-```
-.
-├── SPK1
-│   ├── 21.15-26.44.mp3
-│   ├── 27.51-29.98.mp3
-│   └── 30.1-32.71.mp3
-└── SPK2
-    └── 38.79-40.85.mp3
-```
-
-You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
-
-### 2. Split Training and Validation Sets
-
-```bash
-python tools/vqgan/create_train_split.py data
-```
-
-This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
-
-!!!info
-    For the VITS format, you can specify a file list using `--filelist xxx.list`.  
-    Please note that the audio files in `filelist` must also be located in the `data` folder.
-
-### 3. Start Training
-
-```bash
-python fish_speech/train.py --config-name vqgan_finetune
-```
-
-!!! note
-    You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
-
-### 4. Test the Audio
-    
-```bash
-python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
-```
-
-You can review `fake.wav` to assess the fine-tuning results.
-
-!!! note
-    You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
+    It is recommended to fine-tune the LLAMA first, then fine-tune the `VITS Decoder` according to your needs.
 
 ## Fine-tuning LLAMA
 ### 1. Prepare the dataset
@@ -168,8 +125,27 @@ After training is complete, you can refer to the [inference](inference.md) secti
     By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
     If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
 
-## Fine-tuning VITS
-### 1. Prepare the dataset
+#### Fine-tuning with LoRA
+
+!!! note
+    LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets. 
+
+If you want to use LoRA, please add the following parameter: `+lora@model.lora_config=r_8_alpha_16`. 
+
+After training, you need to convert the LoRA weights to regular weights before performing inference.
+
+```bash
+python tools/llama/merge_lora.py \
+    --llama-config dual_ar_2_codebook_medium \
+    --lora-config r_8_alpha_16 \
+    --llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
+    --lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
+    --output checkpoints/merged.ckpt
+```
+
+
+## Fine-tuning VITS Decoder
+### 1. Prepare the Dataset
 
 ```
 .
@@ -184,32 +160,92 @@ After training is complete, you can refer to the [inference](inference.md) secti
     ├── 38.79-40.85.lab
     └── 38.79-40.85.mp3
 ```
+
 !!! note
-	The fine-tuning for VITS only support the .lab format files, please don't use .list file!
+    VITS fine-tuning currently only supports `.lab` as the label file and does not support the `filelist` format.
 
-You need to convert the dataset to the format above, and move them to the `data` , the suffix of the files can be `.mp3`, `.wav` 或 `.flac`, the label files' suffix are recommended to be  `.lab`.
+You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions, and the annotation files should have the `.lab` extension.
 
-### 2.Start Training
+### 2. Split Training and Validation Sets
 
 ```bash
-python fish_speech/train.py --config-name vits_decoder_finetune
+python tools/vqgan/create_train_split.py data
 ```
 
+This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
 
-#### Fine-tuning with LoRA
+!!! info
+    For the VITS format, you can specify a file list using `--filelist xxx.list`.  
+    Please note that the audio files in `filelist` must also be located in the `data` folder.
+
+### 3. Start Training
+
+```bash
+python fish_speech/train.py --config-name vits_decoder_finetune
+```
 
 !!! note
-    LoRA can reduce the risk of overfitting in models, but it may also lead to underfitting on large datasets. 
+    You can modify training parameters by editing `fish_speech/configs/vits_decoder_finetune.yaml`, but in most cases, this won't be necessary.
 
-If you want to use LoRA, please add the following parameter: `+lora@model.lora_config=r_8_alpha_16`. 
+### 4. Test the Audio
+    
+```bash
+python tools/vits_decoder/inference.py \
+    --checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \
+    -i test.npy -r test.wav \
+    --text "The text you want to generate"
+```
 
-After training, you need to convert the LoRA weights to regular weights before performing inference.
+You can review `fake.wav` to assess the fine-tuning results.
+
+
+## Fine-tuning VQGAN (Not Recommended)
+
+
+We no longer recommend using VQGAN for fine-tuning in version 1.1. Using VITS Decoder will yield better results, but if you still want to fine-tune VQGAN, you can refer to the following steps.
+
+### 1. Prepare the Dataset
+
+```
+.
+├── SPK1
+│   ├── 21.15-26.44.mp3
+│   ├── 27.51-29.98.mp3
+│   └── 30.1-32.71.mp3
+└── SPK2
+    └── 38.79-40.85.mp3
+```
+
+You need to format your dataset as shown above and place it under `data`. Audio files can have `.mp3`, `.wav`, or `.flac` extensions.
+
+### 2. Split Training and Validation Sets
 
 ```bash
-python tools/llama/merge_lora.py \
-    --llama-config dual_ar_2_codebook_medium \
-    --lora-config r_8_alpha_16 \
-    --llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
-    --lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
-    --output checkpoints/merged.ckpt
+python tools/vqgan/create_train_split.py data
 ```
+
+This command will create `data/vq_train_filelist.txt` and `data/vq_val_filelist.txt` in the `data/demo` directory, to be used for training and validation respectively.
+
+!!!info
+    For the VITS format, you can specify a file list using `--filelist xxx.list`.  
+    Please note that the audio files in `filelist` must also be located in the `data` folder.
+
+### 3. Start Training
+
+```bash
+python fish_speech/train.py --config-name vqgan_finetune
+```
+
+!!! note
+    You can modify training parameters by editing `fish_speech/configs/vqgan_finetune.yaml`, but in most cases, this won't be necessary.
+
+### 4. Test the Audio
+    
+```bash
+python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
+```
+
+You can review `fake.wav` to assess the fine-tuning results.
+
+!!! note
+    You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
diff --git a/docs/en/index.md b/docs/en/index.md
index 6ec6f816..8f281d27 100644
--- a/docs/en/index.md
+++ b/docs/en/index.md
@@ -39,13 +39,13 @@ pip3 install torch torchvision torchaudio
 # Install fish-speech
 pip3 install -e .
 
-#install sox
+# (Ubuntu / Debian User) Install sox
 apt install libsox-dev
 ```
 
 ## Changelog
 
-- 2024/05/10: Updated Fish-Speech to 1.1 version, importing VITS as the Decoder part.
+- 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity.
 - 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models.
 - 2023/12/28: Added `lora` fine-tuning support.
 - 2023/12/27: Add `gradient checkpointing`, `causual sampling`, and `flash-attn` support.
diff --git a/docs/en/inference.md b/docs/en/inference.md
index b0bd27b6..fe1b5c2b 100644
--- a/docs/en/inference.md
+++ b/docs/en/inference.md
@@ -5,10 +5,12 @@ Inference support command line, HTTP API and web UI.
 !!! note
     Overall, reasoning consists of several parts:
 
-    1. Encode a given 5-10 seconds of voice using VQGAN.
+    1. Encode a given ~10 seconds of voice using VQGAN.
     2. Input the encoded semantic tokens and the corresponding text into the language model as an example.
     3. Given a new piece of text, let the model generate the corresponding semantic tokens.
-    4. Input the generated semantic tokens into VQGAN to decode and generate the corresponding voice.
+    4. Input the generated semantic tokens into VITS / VQGAN to decode and generate the corresponding voice.
+
+In version 1.1, we recommend using VITS for decoding, as it performs better than VQGAN in both timbre and pronunciation.
 
 ## Command Line Inference
 
@@ -17,6 +19,7 @@ Download the required `vqgan` and `text2semantic` models from our Hugging Face r
 ```bash
 huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
 huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
 ```
 
 ### 1. Generate prompt from voice:
@@ -56,6 +59,16 @@ This command will create a `codes_N` file in the working directory, where N is a
     If you are using your own fine-tuned model, please be sure to carry the `--speaker` parameter to ensure the stability of pronunciation.
 
 ### 3. Generate vocals from semantic tokens:
+
+#### VITS Decoder
+```bash
+python tools/vits_decoder/inference.py \
+    --checkpoint-path checkpoints/vits_decoder_v1.1.ckpt \
+    -i codes_0.npy -r ref.wav \
+    --text "The text you want to generate"
+```
+
+#### VQGAN Decoder (not recommended)
 ```bash
 python tools/vqgan/inference.py \
     -i "codes_0.npy" \
@@ -71,11 +84,20 @@ python -m tools.api \
     --listen 0.0.0.0:8000 \
     --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
     --llama-config-name dual_ar_2_codebook_medium \
-    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
+    --decoder-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth" \
+    --decoder-config-name vqgan_pretrain
 ```
 
 After that, you can view and test the API at http://127.0.0.1:8000/.  
 
+!!! info
+    You should use following parameters to start VITS decoder:
+
+    ```bash
+    --decoder-config-name vits_decoder_finetune \
+    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
+    ```
+
 ## WebUI Inference
 
 You can start the WebUI using the following command:
@@ -84,7 +106,19 @@ You can start the WebUI using the following command:
 python -m tools.webui \
     --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
     --llama-config-name dual_ar_2_codebook_medium \
-    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
+    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth" \
+    --vits-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt"
 ```
 
+!!! info
+    You should use following parameters to start VITS decoder:
+
+    ```bash
+    --decoder-config-name vits_decoder_finetune \
+    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # or your own model
+    ```
+
+!!! note
+    You can use Gradio environment variables, such as `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` to configure WebUI.
+
 Enjoy!
diff --git a/docs/en/samples.md b/docs/en/samples.md
index 61be3157..02918ab9 100644
--- a/docs/en/samples.md
+++ b/docs/en/samples.md
@@ -17,28 +17,28 @@
     <tbody>
     <tr>
         <td>Nahida (Genshin Impact)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/0_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/0_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/0_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/0_output.wav" /></td>
     </tr>
     <tr>
         <td>Zhongli (Genshin Impact)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/1_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/1_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/1_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/1_output.wav" /></td>
     </tr>
     <tr>
         <td>Furina (Genshin Impact)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/2_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/2_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/2_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/2_output.wav" /></td>
     </tr>
     <tr>
         <td>Random Speaker 1</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/4_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/4_output.wav" /></td>
     </tr>
     <tr>
         <td>Random Speaker 2</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/5_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/5_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -64,13 +64,13 @@
     <tbody>
     <tr>
         <td>Nahida (Genshin Impact)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/0_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/6_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/0_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/6_output.wav" /></td>
     </tr>
     <tr>
         <td>Random Speaker</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/7_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/7_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -96,7 +96,7 @@
     <tr>
         <td>Random Speaker</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/8_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/8_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -122,12 +122,12 @@ patterns to driving cars autonomously, AI's applications are vast and diverse.
     <tr>
         <td>Random Speaker 1</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/en/0_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/en/0_output.wav" /></td>
     </tr>
     <tr>
         <td>Random Speaker 2</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/en/1_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/en/1_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -155,7 +155,7 @@ me to serve as your personal voice assistant.
     <tr>
         <td>Random Speaker</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/en/2_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/en/2_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -181,12 +181,12 @@ me to serve as your personal voice assistant.
     <tr>
         <td>Random Speaker 1</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/ja/0_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/ja/0_output.wav" /></td>
     </tr>
     <tr>
         <td>Random Speaker 2</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/ja/1_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/ja/1_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -213,7 +213,7 @@ me to serve as your personal voice assistant.
     <tr>
         <td>Random Speaker</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/ja/2_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/ja/2_output.wav" /></td>
     </tr>
     </tbody>
 </table>
diff --git a/docs/zh/finetune.md b/docs/zh/finetune.md
index 25bcc138..0c55af4a 100644
--- a/docs/zh/finetune.md
+++ b/docs/zh/finetune.md
@@ -2,65 +2,22 @@
 
 显然, 当你打开这个页面的时候, 你已经对预训练模型 few-shot 的效果不算满意. 你想要微调一个模型, 使得它在你的数据集上表现更好.  
 
-`Fish Speech` 由三个模块组成: `VQGAN`,`LLAMA`和`VITS`. 
+`Fish Speech` 由三个模块组成: `VQGAN`,`LLAMA`, 以及 `VITS Decoder`. 
 
 !!! info 
-    你应该先进行如下测试来判断你是否需要微调 `VQGAN`:
+    你应该先进行如下测试来判断你是否需要微调 `VITS Decoder`
     ```bash
     python tools/vqgan/inference.py -i test.wav
+    python tools/vits_decoder/inference.py \
+        -ckpt checkpoints/vits_decoder_v1.1.ckpt \
+        -i fake.npy -r test.wav \
+        --text "合成文本"
     ```
-    该测试会生成一个 `fake.wav` 文件, 如果该文件的音色和说话人的音色不同, 或者质量不高, 你需要微调 `VQGAN`.
+    该测试会生成一个 `fake.wav` 文件, 如果该文件的音色和说话人的音色不同, 或者质量不高, 你需要微调 `VITS Decoder`.
 
     相应的, 你可以参考 [推理](inference.md) 来运行 `generate.py`, 判断韵律是否满意, 如果不满意, 则需要微调 `LLAMA`.
 
-    建议先对LLAMA以及VITS进行微调,最后再根据需要微调 `VQGAN `.
-
-## VQGAN 微调(如果对推理音频不满意再微调)
-### 1. 准备数据集
-
-```
-.
-├── SPK1
-│   ├── 21.15-26.44.mp3
-│   ├── 27.51-29.98.mp3
-│   └── 30.1-32.71.mp3
-└── SPK2
-    └── 38.79-40.85.mp3
-```
-
-你需要将数据集转为以上格式, 并放到 `data` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
-
-### 2. 分割训练集和验证集
-
-```bash
-python tools/vqgan/create_train_split.py data
-```
-
-该命令会在 `data` 目录下创建 `data/vq_train_filelist.txt` 和 `data/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
-
-!!!info
-    对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.  
-    请注意, `filelist` 所指向的音频文件必须也位于 `data` 文件夹下.
-
-### 3. 启动训练
-
-```bash
-python fish_speech/train.py --config-name vqgan_finetune
-```
-
-!!! note
-    你可以通过修改 `fish_speech/configs/vqgan_finetune.yaml` 来修改训练参数, 但大部分情况下, 你不需要这么做.
-
-### 4. 测试音频
-    
-```bash
-python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
-```
-
-你可以查看 `fake.wav` 来判断微调效果.
-
-!!! note
-    你也可以尝试其他的 checkpoint, 我们建议你使用最早的满足你要求的 checkpoint, 他们通常在 OOD 上表现更好.
+    建议先对 `LLAMA` 进行微调,最后再根据需要微调 `VITS Decoder`.
 
 ## LLAMA 微调
 ### 1. 准备数据集
@@ -179,7 +136,24 @@ python fish_speech/train.py --config-name text2semantic_finetune \
     默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.  
     如果你想要学到音色, 请将训练步数调大, 但这有可能会导致过拟合.
 
-## VITS微调
+#### 使用 LoRA 进行微调
+!!! note
+    LoRA 可以减少模型过拟合的风险, 但是相应的会导致在大数据集上欠拟合.   
+
+如果你想使用 LoRA, 请添加以下参数 `+lora@model.lora_config=r_8_alpha_16`.  
+
+训练完成后, 你需要先将 loRA 的权重转为普通权重, 然后再进行推理.
+
+```bash
+python tools/llama/merge_lora.py \
+    --llama-config dual_ar_2_codebook_medium \
+    --lora-config r_8_alpha_16 \
+    --llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
+    --lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
+    --output checkpoints/merged.ckpt
+```
+
+## VITS 微调
 ### 1. 准备数据集
 
 ```
@@ -196,29 +170,88 @@ python fish_speech/train.py --config-name text2semantic_finetune \
     └── 38.79-40.85.mp3
 ```
 !!! note
-	VITS微调目前仅支持.lab作为标签文件,不支持filelist形式!
+	VITS 微调目前仅支持 `.lab` 作为标签文件,不支持 `filelist` 形式.
 
 你需要将数据集转为以上格式, 并放到 `data` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`, 标注文件后缀建议为 `.lab`.
 
-### 2.启动训练
+### 2. 分割训练集和验证集
+
+```bash
+python tools/vqgan/create_train_split.py data
+```
+
+该命令会在 `data` 目录下创建 `data/vq_train_filelist.txt` 和 `data/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
+
+!!! info
+    对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.  
+    请注意, `filelist` 所指向的音频文件必须也位于 `data` 文件夹下.
+
+### 3. 启动训练
 
 ```bash
 python fish_speech/train.py --config-name vits_decoder_finetune
 ```
 
-#### 使用 lora 进行微调
 !!! note
-    lora 可以减少模型过拟合的风险, 但是相应的会导致在大数据集上欠拟合.   
+    你可以通过修改 `fish_speech/configs/vits_decoder_finetune.yaml` 来修改训练参数, 如数据集配置.
+
+### 4. 测试音频
+    
+```bash
+python tools/vits_decoder/inference.py \
+    --checkpoint-path results/vits_decoder_finetune/checkpoints/step_000010000.ckpt \
+    -i test.npy -r test.wav \
+    --text "合成文本"
+```
+
+你可以查看 `fake.wav` 来判断微调效果.
+
+## VQGAN 微调 (不推荐)
 
-如果你想使用 lora, 请添加以下参数 `+lora@model.lora_config=r_8_alpha_16`.  
+在 V1.1 版本中, 我们不再推荐使用 VQGAN 进行微调, 使用 VITS Decoder 会活得更好的表现, 但是如果你仍然想要使用 VQGAN 进行微调, 你可以参考以下步骤.
 
-训练完成后, 你需要先将 lora 的权重转为普通权重, 然后再进行推理.
+### 1. 准备数据集
+
+```
+.
+├── SPK1
+│   ├── 21.15-26.44.mp3
+│   ├── 27.51-29.98.mp3
+│   └── 30.1-32.71.mp3
+└── SPK2
+    └── 38.79-40.85.mp3
+```
+
+你需要将数据集转为以上格式, 并放到 `data` 下, 音频后缀可以为 `.mp3`, `.wav` 或 `.flac`.
+
+### 2. 分割训练集和验证集
 
 ```bash
-python tools/llama/merge_lora.py \
-    --llama-config dual_ar_2_codebook_medium \
-    --lora-config r_8_alpha_16 \
-    --llama-weight checkpoints/text2semantic-sft-medium-v1.1-4k.pth \
-    --lora-weight results/text2semantic-finetune-medium-lora/checkpoints/step_000000200.ckpt \
-    --output checkpoints/merged.ckpt
+python tools/vqgan/create_train_split.py data
+```
+
+该命令会在 `data` 目录下创建 `data/vq_train_filelist.txt` 和 `data/vq_val_filelist.txt` 文件, 分别用于训练和验证.  
+
+!!!info
+    对于 VITS 格式, 你可以使用 `--filelist xxx.list` 来指定文件列表.  
+    请注意, `filelist` 所指向的音频文件必须也位于 `data` 文件夹下.
+
+### 3. 启动训练
+
+```bash
+python fish_speech/train.py --config-name vqgan_finetune
+```
+
+!!! note
+    你可以通过修改 `fish_speech/configs/vqgan_finetune.yaml` 来修改训练参数, 但大部分情况下, 你不需要这么做.
+
+### 4. 测试音频
+    
+```bash
+python tools/vqgan/inference.py -i test.wav --checkpoint-path results/vqgan_finetune/checkpoints/step_000010000.ckpt
 ```
+
+你可以查看 `fake.wav` 来判断微调效果.
+
+!!! note
+    你也可以尝试其他的 checkpoint, 我们建议你使用最早的满足你要求的 checkpoint, 他们通常在 OOD 上表现更好.
diff --git a/docs/zh/index.md b/docs/zh/index.md
index cc770d62..842250b2 100644
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -39,14 +39,14 @@ pip3 install torch torchvision torchaudio
 # 安装 fish-speech
 pip3 install -e .
 
-# 安装 sox
+# (Ubuntu / Debian 用户) 安装 sox
 apt install libsox-dev
 ```
 
 
 ## 更新日志
 
-- 2024/05/10: 更新了 Fish-Speech 到 1.1 版本,引入了 VITS 作为Decoder部分.
+- 2024/05/10: 更新了 Fish-Speech 到 1.1 版本,引入了 VITS Decoder 来降低口胡和提高音色相似度.
 - 2024/04/22: 完成了 Fish-Speech 1.0 版本, 大幅修改了 VQGAN 和 LLAMA 模型.
 - 2023/12/28: 添加了 `lora` 微调支持.
 - 2023/12/27: 添加了 `gradient checkpointing`, `causual sampling` 和 `flash-attn` 支持.
diff --git a/docs/zh/inference.md b/docs/zh/inference.md
index 3dc0ed94..94c69275 100644
--- a/docs/zh/inference.md
+++ b/docs/zh/inference.md
@@ -5,10 +5,12 @@
 !!! note
     总的来说, 推理分为几个部分:  
 
-    1. 给定一段 5-10 秒的语音, 将它用 VQGAN 编码.  
+    1. 给定一段 ~10 秒的语音, 将它用 VQGAN 编码.  
     2. 将编码后的语义 token 和对应文本输入语言模型作为例子.  
     3. 给定一段新文本, 让模型生成对应的语义 token.  
-    4. 将生成的语义 token 输入 VQGAN 解码, 生成对应的语音.  
+    4. 将生成的语义 token 输入 VITS / VQGAN 解码, 生成对应的语音.  
+
+在 V1.1 版本中, 我们推荐优先使用 VITS 解码器, 因为它在音质和口胡上都有更好的表现.
 
 ## 命令行推理
 
@@ -17,11 +19,15 @@
 ```bash
 huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
 huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
+huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
 ```
+
 对于中国大陆用户,可使用mirror下载。
+
 ```bash
 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 vq-gan-group-fsq-2x1024.pth --local-dir checkpoints
 HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 text2semantic-sft-medium-v1.1-4k.pth --local-dir checkpoints
+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech-1 vits_decoder_v1.1.ckpt --local-dir checkpoints
 ```
 
 ### 1. 从语音生成 prompt: 
@@ -61,6 +67,16 @@ python tools/llama/generate.py \
     如果你在使用自己微调的模型, 请务必携带 `--speaker` 参数来保证发音的稳定性.
 
 ### 3. 从语义 token 生成人声: 
+
+#### VITS 解码
+```bash
+python tools/vits_decoder/inference.py \
+    --checkpoint-path checkpoints/vits_decoder_v1.1.ckpt \
+    -i codes_0.npy -r ref.wav \
+    --text "要生成的文本"
+```
+
+#### VQGAN 解码 (不推荐)
 ```bash
 python tools/vqgan/inference.py \
     -i "codes_0.npy" \
@@ -76,7 +92,8 @@ python -m tools.api \
     --listen 0.0.0.0:8000 \
     --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
     --llama-config-name dual_ar_2_codebook_medium \
-    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
+    --decoder-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth" \
+    --decoder-config-name vqgan_pretrain
 
 # 推荐中国大陆用户运行以下命令来启动 HTTP 服务:
 HF_ENDPOINT=https://hf-mirror.com python -m ...
@@ -84,6 +101,14 @@ HF_ENDPOINT=https://hf-mirror.com python -m ...
 
 随后, 你可以在 `http://127.0.0.1:8000/` 中查看并测试 API.
 
+!!! info
+    你应该使用以下参数来启动 VITS 解码器:
+
+    ```bash
+    --decoder-config-name vits_decoder_finetune \
+    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # 或者你自己的模型
+    ```
+
 ## WebUI 推理
 
 你可以使用以下命令来启动 WebUI:
@@ -92,7 +117,19 @@ HF_ENDPOINT=https://hf-mirror.com python -m ...
 python -m tools.webui \
     --llama-checkpoint-path "checkpoints/text2semantic-sft-medium-v1.1-4k.pth" \
     --llama-config-name dual_ar_2_codebook_medium \
-    --vqgan-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth"
+    --decoder-checkpoint-path "checkpoints/vq-gan-group-fsq-2x1024.pth" \
+    --decoder-config-name vqgan_pretrain
 ```
 
+!!! info
+    你应该使用以下参数来启动 VITS 解码器:
+
+    ```bash
+    --decoder-config-name vits_decoder_finetune \
+    --decoder-checkpoint-path "checkpoints/vits_decoder_v1.1.ckpt" # 或者你自己的模型
+    ```
+
+!!! note
+    你可以使用 Gradio 环境变量, 如 `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` 来配置 WebUI.
+
 祝大家玩得开心!
diff --git a/docs/zh/samples.md b/docs/zh/samples.md
index 8cdfa2a5..77554a87 100644
--- a/docs/zh/samples.md
+++ b/docs/zh/samples.md
@@ -17,28 +17,28 @@
     <tbody>
     <tr>
         <td>纳西妲 (原神)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/0_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/0_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/0_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/0_output.wav" /></td>
     </tr>
     <tr>
         <td>钟离 (原神)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/1_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/1_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/1_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/1_output.wav" /></td>
     </tr>
     <tr>
         <td>芙宁娜 (原神)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/2_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/2_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/2_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/2_output.wav" /></td>
     </tr>
     <tr>
         <td>随机说话人 1</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/4_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/4_output.wav" /></td>
     </tr>
     <tr>
         <td>随机说话人 2</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/5_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/5_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -64,13 +64,13 @@
     <tbody>
     <tr>
         <td>纳西妲 (原神)</td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/0_input.wav" /></td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/6_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/0_input.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/6_output.wav" /></td>
     </tr>
     <tr>
         <td>随机说话人</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/7_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/7_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -96,7 +96,7 @@
     <tr>
         <td>随机说话人</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/zh/8_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/zh/8_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -122,12 +122,12 @@ patterns to driving cars autonomously, AI's applications are vast and diverse.
     <tr>
         <td>随机说话人 1</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/en/0_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/en/0_output.wav" /></td>
     </tr>
     <tr>
         <td>随机说话人 2</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/en/1_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/en/1_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -155,7 +155,7 @@ me to serve as your personal voice assistant.
     <tr>
         <td>随机说话人</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/en/2_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/en/2_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -181,12 +181,12 @@ me to serve as your personal voice assistant.
     <tr>
         <td>随机说话人 1</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/ja/0_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/ja/0_output.wav" /></td>
     </tr>
     <tr>
         <td>随机说话人 2</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/ja/1_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/ja/1_output.wav" /></td>
     </tr>
     </tbody>
 </table>
@@ -213,7 +213,7 @@ me to serve as your personal voice assistant.
     <tr>
         <td>随机说话人</td>
         <td> - </td>
-        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1-sft/ja/2_output.wav" /></td>
+        <td><audio controls preload="auto" src="https://demo-r2.speech.fish.audio/v1.1-sft-large/ja/2_output.wav" /></td>
     </tr>
     </tbody>
 </table>