utils

Feb 1, 2023

8c0292d · Feb 1, 2023

Name	Name	Last commit message	Last commit date
parent directory ..
configs	configs	Release	Feb 1, 2023
filelists	filelists	Release	Feb 1, 2023
models	models	Release	Feb 1, 2023
resources	resources	Release	Feb 1, 2023
text	text	Release	Feb 1, 2023
README.md	README.md	Release	Feb 1, 2023
__init__.py	__init__.py	Release	Feb 1, 2023
attach_memory_bank.py	attach_memory_bank.py	Release	Feb 1, 2023
commons.py	commons.py	Release	Feb 1, 2023
data_utils.py	data_utils.py	Release	Feb 1, 2023
mel_processing.py	mel_processing.py	Release	Feb 1, 2023
preprocess_durations.py	preprocess_durations.py	Release	Feb 1, 2023
preprocess_texts.py	preprocess_texts.py	Release	Feb 1, 2023
requirements.txt	requirements.txt	Release	Feb 1, 2023
train.py	train.py	Release	Feb 1, 2023
utils.py	utils.py	Release	Feb 1, 2023

README.md

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

This is an implementation of Microsoft's NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality in Pytorch.

Contribution and pull requests are highly appreciated!

23.02.01: Pretrained models or demo samples will soon be released.

Overview

Naturalspeech is a VAE-based model that employs several techniques to improve the prior and simplify the posterior. It differs from VITS in several ways, including:

Phoneme pre-training: Naturalspeech uses a pre-trained phoneme encoder on a large text corpus, obtained through masked language modeling on phoneme sequences.
Differentiable durator: The posterior operates at the frame level, while the prior operates at the phoneme level. Naturalspeech uses a differentiable durator to bridge the length difference, resulting in soft and flexible features that are expanded.
Bidirectional Prior/Posterior: Naturalspeech reduces the posterior and enhances the prior through normalizing flow, which maps in both directions with forward and backward loss.
Memory-based VAE: The prior is further enhanced through a memory bank using Q-K-V attention."

Notes

Phoneme pre-training with large-scale text corpus from the news-crawl dataset is omitted in this implementation.
Multiplier for each loss is denoted and can be adjusted in config, as using losses with no multiplier doesn't seem to converge.
Tuning stage for last 2k epochs is omitted.
As soft-dtw loss uses quite a lot of VRAM, there is an option for using non-softdtw loss.
For soft-dtw loss, warp is set to 134.4(=0.07 * 192), not 0.07, to match with non-softdtw loss.
To train the duration predictor in the warmup stage, duration labels are needed. As stated in the paper, you can choose any tools to provide duration label. Here I used pretrained VITS model.
For memory efficient training, partial sequences are fed to decoder as in VITS.

How to train

# python >= 3.6
pip install -r requirements.txt

clone this repository
download The LJ Speech Dataset: link

create symbolic link to ljspeech dataset:

ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1

text preprocessing (optional, if you are using custom dataset):

apt-get install espeak

python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt

duration preprocessing (obtain duration labels using pretrained VITS):
1. git clone https://github.com/jaywalnut310/vits.git; cd vits
2. create symbolic link to ljspeech dataset
```
ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1
```
3. download pretrained VITS model described as from VITS official github: github link / pretrained models
4. setup monotonic alignment search (for VITS inference):
```
cd monotonic_align; mkdir monotonic_align; python setup.py build_ext --inplace; cd ..
```
5. copy duration preprocessing script to VITS repo: cp /path/to/naturalspeech/preprocess_durations.py .
6. ```
python3 preprocess_durations.py --weights_path ./pretrained_ljs.pth --filelists filelists/ljs_audio_text_train_filelist.txt.cleaned filelists/ljs_audio_text_val_filelist.txt.cleaned filelists/ljs_audio_text_test_filelist.txt.cleaned
```
7. once the duration labels are created, copy the labels to the naturalspeech repo: cp -r durations/ path/to/naturalspeech
train (warmup)
```
python3 train.py -c configs/ljs.json -m [run_name] --warmup
```
Note here that ljs.json is for low-resource training, which runs for 1500 epochs and does not use soft-dtw loss. If you want to reproduce the steps stated in the paper, use ljs_reproduce.json, which runs for 15000 epochs and uses soft-dtw loss.
initialize and attach memory bank after warmup:
```
  python3 attach_memory_bank.py -c configs/ljs.json --weights_path logs/[run_name]/G_xxx.pth
```
if you lack memory, you can specify the "--num_samples" argument to use only a subset of samples.

train (resume)

  python3 train.py -c configs/ljs.json -m [run_name]

You can use tensorboard to monitor the training.

tensorboard --logdir /path/to/naturalspeech/logs

During each evaluation phase, a selection of samples from the test set is evaluated and saved in the logs/[run_name]/eval directory.

References

VITS implemetation by @jaywalnut310 for normalizing flows, phoneme encoder, and hifi-gan decoder implementation
Parallel Tacotron 2 Implementation by @keonlee9420 for learnable upsampling Layer
soft-dtw implementation by @Maghoumi for sdtw loss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

utils

utils

README.md

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Overview

Notes

How to train

References

Files

utils

Directory actions

More options

Directory actions

More options

Latest commit

History

utils

Folders and files

parent directory

README.md

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Overview

Notes

How to train

References