Skip to content

Typical Kaldi project structure

Rustam edited this page Jun 12, 2019 · 1 revision

Notation

  • train / test - are two folders with similar structure and content
  • CAPITAL_CASE - folders which with non-fixed name and structure
  • -> - symlinks
  • * - optional, but needed to run like in "kaldi for dummies"

Typical metadata structure of a Kaldi project:

Good description can be found here

my_project_folder/
├── *conf # configuration modifications in decoding and mfcc feature extraction processes - taken from /egs/voxforge
│   ├── *decode.config
│   └── *mfcc.conf
├── data
│   ├── local
│   │   ├── corpus.txt # Every single utterance text that can occur (from both train and test). Simple txt file, where each row is a text, i.e. like: "HELLOW WORLD", "MY NAME IS MIKE", etc.
│   │   └── dict
│   │       ├── lexicon.txt            # mapping words to phonemes. Rows like: `zero z ih r ow`, `zero z iy r ow`, etc.
│   │       ├── nonsilence_phones.txt  # lists all nonsilence phones that are present, one row - one phoneme.
│   │       ├── optional_silence.txt   # lists optional silence phones
│   │       └── silence_phones.txt     # lists all silence phones that are present, one row - one phoneme.
│   ├── train / test # Folder(s) with main mappings for wav files. Each file is simple txt file: one row - one mapping
│   │   ├── spk2gender # speakerID to gender. rows like: "speaker1 f", "speaker2 m"
│   │   ├── text       # utterance to text in corresponding file. Rows like: "speaker1_a12 HELLOW WORLD", "speaker2_a13 MY NAME IS MIKE"
│   │   ├── utt2spk    # utterance (i.e. "<speakerID>_<wav-file name>") to speaker. Rows like: "speaker1_a12 speaker1", "speaker2_a13 speaker2"
│   │   └── wav.scp    # utterance to corresponding wav-file. Rows like: "speaker1_a12/path/to/wavfile/speaker1/a12.wav"
├── WAV_FILES_FOLDER
│   └── SOMEHOW_STRUCTURED_WAV_FILES 
├── local
│   └── *score.sh  # copy of /egs/voxforge/s5/local/score.sh
├── cmd.sh   # We will first need to configure whether the jobs need to run locally or on the Oracle GridEngine. Instructions on how to do this are in cmd.sh
├── path.sh  # sets essential paths: KALDI_ROOT, DATA_ROOT, PATH, and other (from /tools/env.sh) + export LC_ALL=C - for proper data sorting
├── run.sh   # run
├── steps -> kaldi/egs/wsj/s5/steps
└── utils -> kaldi/egs/wsj/s5/utils

These are not all files needed! But rest of them can be created in run.sh using kaldi's utils:

  • data/{train,test}/spk2utt - mapping speakerID to utterance, i.e. rows like: speaker1 speaker1_a13 speaker1_a45 speaker67. Created by utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
  • data/local/lang/* and data/lang/* - main idea: 1) one phone to 4 (_B - begin,_E - end,*_I - inside, *_S - silence???) 2) code phones by numbers 3) code words by numbers 4)?? binarization of word2phonemes. Created by utils/prepare_lang.sh data/local/dict "<UNK>" data/local/lang data/lang
  • lm.arpa - n-gramm weighted counts. Created by ngram-count -order $lm_order -write-vocab $local/tmp/vocab-full.txt -wbdiscount -text $local/corpus.txt -lm $local/tmp/lm.arpa NOTE! srilm must be installed.
  • G.fst - arpa (i.e. n-gramm counts) to fst (finite-state transducers) graph. This is used then by OpenFST. Created by arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/words.txt $local/tmp/lm.arpa $lang/G.fst
Clone this wiki locally