forked from kaldi-asr/kaldi
-
Notifications
You must be signed in to change notification settings - Fork 0
Typical Kaldi project structure
Rustam edited this page Jun 12, 2019
·
1 revision
Notation
-
train / test
- are two folders with similar structure and content -
CAPITAL_CASE
- folders which with non-fixed name and structure -
->
- symlinks -
*
- optional, but needed to run like in "kaldi for dummies"
Typical metadata structure of a Kaldi project:
Good description can be found here
my_project_folder/
├── *conf # configuration modifications in decoding and mfcc feature extraction processes - taken from /egs/voxforge
│ ├── *decode.config
│ └── *mfcc.conf
├── data
│ ├── local
│ │ ├── corpus.txt # Every single utterance text that can occur (from both train and test). Simple txt file, where each row is a text, i.e. like: "HELLOW WORLD", "MY NAME IS MIKE", etc.
│ │ └── dict
│ │ ├── lexicon.txt # mapping words to phonemes. Rows like: `zero z ih r ow`, `zero z iy r ow`, etc.
│ │ ├── nonsilence_phones.txt # lists all nonsilence phones that are present, one row - one phoneme.
│ │ ├── optional_silence.txt # lists optional silence phones
│ │ └── silence_phones.txt # lists all silence phones that are present, one row - one phoneme.
│ ├── train / test # Folder(s) with main mappings for wav files. Each file is simple txt file: one row - one mapping
│ │ ├── spk2gender # speakerID to gender. rows like: "speaker1 f", "speaker2 m"
│ │ ├── text # utterance to text in corresponding file. Rows like: "speaker1_a12 HELLOW WORLD", "speaker2_a13 MY NAME IS MIKE"
│ │ ├── utt2spk # utterance (i.e. "<speakerID>_<wav-file name>") to speaker. Rows like: "speaker1_a12 speaker1", "speaker2_a13 speaker2"
│ │ └── wav.scp # utterance to corresponding wav-file. Rows like: "speaker1_a12/path/to/wavfile/speaker1/a12.wav"
├── WAV_FILES_FOLDER
│ └── SOMEHOW_STRUCTURED_WAV_FILES
├── local
│ └── *score.sh # copy of /egs/voxforge/s5/local/score.sh
├── cmd.sh # We will first need to configure whether the jobs need to run locally or on the Oracle GridEngine. Instructions on how to do this are in cmd.sh
├── path.sh # sets essential paths: KALDI_ROOT, DATA_ROOT, PATH, and other (from /tools/env.sh) + export LC_ALL=C - for proper data sorting
├── run.sh # run
├── steps -> kaldi/egs/wsj/s5/steps
└── utils -> kaldi/egs/wsj/s5/utils
These are not all files needed! But rest of them can be created in run.sh using kaldi's utils:
-
data/{train,test}/spk2utt
- mapping speakerID to utterance, i.e. rows like:speaker1 speaker1_a13 speaker1_a45 speaker67
. Created byutils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
-
data/local/lang/*
anddata/lang/*
- main idea: 1) one phone to 4 (_B - begin,_E - end,*_I - inside, *_S - silence???) 2) code phones by numbers 3) code words by numbers 4)?? binarization of word2phonemes. Created byutils/prepare_lang.sh data/local/dict "<UNK>" data/local/lang data/lang
-
lm.arpa
- n-gramm weighted counts. Created byngram-count -order $lm_order -write-vocab $local/tmp/vocab-full.txt -wbdiscount -text $local/corpus.txt -lm $local/tmp/lm.arpa
NOTE! srilm must be installed. -
G.fst
- arpa (i.e. n-gramm counts) to fst (finite-state transducers) graph. This is used then by OpenFST. Created byarpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/words.txt $local/tmp/lm.arpa $lang/G.fst