In this repo, we rely on the Kaldi-style data formatting.
We take the LibriTTS (including clean
and other
partitions) for example.
We have organized the data
directory containing all the LibriTTS data. Here are the steps to establish the data
dir.
- Please download from here (about 5MB), and unzip it to
data
in the project root. Every sub-directory contains awav.scp
file. It is a plain text file, with<key> <value>
in each line. NOTE: onlywav.scp
is needed for training althoughutt2spk,spk2utt
are also present. - Then, change the paths in
wav.scp
to the correct ones in your machine.
We include three types of speech features in vec2wav 2.0. They should all be extracted offline and stored in ./feats/
.
- VQ index (together with codebook) from vq-wav2vec. We extracted it by fairseq,
and we provide the extracted VQ index sequences with codebook online.
- Please download from here (460MB; here for Chinese users).
- Unzip it to
feats/vqidx
, and change the corresponding paths in thefeats.scp
. - You can check out the feature shape by
feat-to-shape.py scp:feats/vqidx/eval_all/feats.scp | head
. The shapes should be(frames, 2)
. - Get the number of frames:
for name in train_all dev_all eval_all; do feat-to-len.py scp:feats/vqidx/$name/feats.scp > data/$name/utt2num_frames done
Note that you can use vec2wav2/ssl_models/vqw2v_extractor.py
to extract these indexes locally.
To get the codebook.npy
locally, use:
from vec2wav2.ssl_models.vqw2v_extractor import Extractor
import numpy as np
extractor = Extractor()
codebook = extractor.get_codebook()
np.save("feats/vqidx/codebook.npy", codebook)
- Mel spectrograms (FBanks). As they are too large, we provide a script to extract them locally:
nj=64 # parallel jobs. Set this according to your CPU cores. bash extract_fbank.sh --nj $nj --stage 0 --stop_stage 1 # Default: 80-dim with 10ms frame shift # Stage 0 extracts fbank in parallel. Stage 1 performs normalization.
This will create feats/fbank
and feats/normed_fbank
each about 16GB. You can delete feats/fbank
after normalization (just it would be better if you keep the train_all/cmvn.ark
there).
- WavLM features. As they are too large, please use
vec2wav2/ssl_models/wavlm_extractor.py
to extract them locally:This will createname=train_all # change to dev_all or eval_all for different splits python vec2wav2/ssl_models/wavlm_extractor.py --wav-scp data/$name/wav.scp \ --out-dir feats/wavlm_l6/$name/ --output-layer 6
feats/wavlm_l6/$name/feats.ark
andfeats.scp
.⚠️ Note that the WavLM features for entire training set can be very large (~380GB)! It is also reasonable to extract them on-the-fly, but this might slow down training.
Finally, you have correctly formatted the data for training!