Skip to content

Latest commit

 

History

History
52 lines (44 loc) · 3.38 KB

data_prep.md

File metadata and controls

52 lines (44 loc) · 3.38 KB

Data Preparation

In this repo, we rely on the Kaldi-style data formatting. We take the LibriTTS (including clean and other partitions) for example.

data/ directory: data manifests

We have organized the data directory containing all the LibriTTS data. Here are the steps to establish the data dir.

  1. Please download from here (about 5MB), and unzip it to data in the project root. Every sub-directory contains a wav.scp file. It is a plain text file, with <key> <value> in each line. NOTE: only wav.scp is needed for training although utt2spk,spk2utt are also present.
  2. Then, change the paths in wav.scp to the correct ones in your machine.

feats/ directory: speech features

We include three types of speech features in vec2wav 2.0. They should all be extracted offline and stored in ./feats/.

  • VQ index (together with codebook) from vq-wav2vec. We extracted it by fairseq, and we provide the extracted VQ index sequences with codebook online.
    1. Please download from here (460MB; here for Chinese users).
    2. Unzip it to feats/vqidx, and change the corresponding paths in the feats.scp.
    3. You can check out the feature shape by feat-to-shape.py scp:feats/vqidx/eval_all/feats.scp | head. The shapes should be (frames, 2).
    4. Get the number of frames:
      for name in train_all dev_all eval_all; do
        feat-to-len.py scp:feats/vqidx/$name/feats.scp > data/$name/utt2num_frames
      done

Note that you can use vec2wav2/ssl_models/vqw2v_extractor.py to extract these indexes locally.

To get the codebook.npy locally, use:

from vec2wav2.ssl_models.vqw2v_extractor import Extractor
import numpy as np
extractor = Extractor()
codebook = extractor.get_codebook()
np.save("feats/vqidx/codebook.npy", codebook)
  • Mel spectrograms (FBanks). As they are too large, we provide a script to extract them locally:
    nj=64  # parallel jobs. Set this according to your CPU cores.
    bash extract_fbank.sh --nj $nj --stage 0 --stop_stage 1  # Default: 80-dim with 10ms frame shift
    # Stage 0 extracts fbank in parallel. Stage 1 performs normalization.

This will create feats/fbank and feats/normed_fbank each about 16GB. You can delete feats/fbank after normalization (just it would be better if you keep the train_all/cmvn.ark there).

  • WavLM features. As they are too large, please use vec2wav2/ssl_models/wavlm_extractor.py to extract them locally:
    name=train_all  # change to dev_all or eval_all for different splits
    python vec2wav2/ssl_models/wavlm_extractor.py --wav-scp data/$name/wav.scp \
                       --out-dir feats/wavlm_l6/$name/ --output-layer 6
    This will create feats/wavlm_l6/$name/feats.ark and feats.scp. ⚠️Note that the WavLM features for entire training set can be very large (~380GB)! It is also reasonable to extract them on-the-fly, but this might slow down training.

Finally, you have correctly formatted the data for training!