Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

Training Notes DeepSpeech 0.9.3 (2021.07.22 pre release)

Daniele Scasciafratte edited this page Jul 22, 2021 · 1 revision

Datasets

total 475h of cleaned data.

  • CommonVoice 6.1 (Cleaned) : 126h
  • MITADS-Speech (Cleaned): 349h

Generate MITADS-Speech Dataset:

# run importers (80G space are required)
python evalita_importer.py -d download_folder -o output_csv_folder
python siwis_importer.py -d download_folder -o output_csv_folder
python mspka_importer.py -d download_folder -o output_csv_folder
python m-ailabs_importer.py -d download_folder -o output_csv_folder
python mls_importer.py -d download_folder -o output_csv_folder
python voxforge_importer.py -d download_folder -o output_csv_folder

# run corpora collector, corpus aggregation logic from yaml file
python corpora_collector.py -c mitads-speech-full.yaml -o output_csv_folder -d dest_folder

Download and preprocess CommonVoice Dataset

# run bin/import_cv2.py

Download and prepare Noise Datasets (for data augmentation)

follow these steps: https://gitlab.com/Jaco-Assistant/Scribosermo/-/tree/master/preprocessing#download-and-prepare-noise-data

Training Pure Model (with Data Augmentation)

python DeepSpeech.py \
--train_cudnn true \
--alphabet_config_path /mnt/ds_models/italian_alphabet.txt \
--train_files /mnt/datasets/cv/cv_v6/cv-it/clips/train.csv,/mnt/datasets/mitads-full/train_evalita.csv,/mnt/datasets/mitads-full/train_mailabs.csv,/mnt/datasets/mitads-full/train_mls.csv,/mnt/datasets/mitads-full/train_mspka.csv,/mnt/datasets/mitads-full/train_siwis.csv,/mnt/datasets/mitads-full/train_voxforge.csv \
--dev_files /mnt/datasets/cv/cv_v6/cv-it/clips/dev.csv,/mnt/datasets/mitads-full/dev_evalita.csv,/mnt/datasets/mitads-full/dev_mailabs.csv,/mnt/datasets/mitads-full/dev_mls.csv,/mnt/datasets/mitads-full/dev_mspka.csv,/mnt/datasets/mitads-full/dev_siwis.csv,/mnt/datasets/mitads-full/dev_voxforge.csv \
--train_batch_size 72 \
--dev_batch_size 72 \
--n_hidden 2048 \
--epochs 75 \
--learning_rate 0.0001 \
--force_initialize_learning_rate true \
--reduce_lr_on_plateau true \
--plateau_epochs 3 \
--plateau_reduction 0.1 \
--dropout_rate 0.25 \
--load_checkpoint_dir /mnt/ds_models/ckpts/ita/deepspeech-0.9.3-checkpoint/exp10_augmentation \
--save_checkpoint_dir /mnt/ds_models/ckpts/ita/deepspeech-0.9.3-checkpoint/exp10_augmentation \
--automatic_mixed_precision true \
--early_stop true \
--es_epochs 7 \
--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
--augment resample[p=0.1,rate=12000:8000~4000] \
--augment codec[p=0.1,bitrate=48000:16000] \
--augment volume[p=0.1,dbfs=-10:-40] \
--augment pitch[p=0.1,pitch=1.0~0.1] \
--augment tempo[p=0.1,factor=1.0~0.25] \
--augment overlay[p=0.1,source=/mnt/datasets/mitads-full/train_mailabs.csv,layers=10~2,snr=12~4] \
--augment overlay[p=0.1,source=/mnt/datasets/mitads-full/train_mls.csv,layers=10~2,snr=12~4] \
--augment overlay[p=0.5,source=/mnt/datasets/noise-datasets-prepared/train.csv,layers=2:1,snr=18:9~6]\

Test Result

CommonVoice Test Dataset - WER: 0.301681, CER: 0.124658, loss: 36.316822

MITADS-Speech Test Dataset - WER: 0.143407, CER: 0.043863, loss: 24.309088

Training Transfer Learning Model (from English model 0.9.3)

python DeepSpeech.py \
--show_progressbar true \
--train_cudnn true \
--drop_source_layers 1 \
--alphabet_config_path /mnt/ds_models/italian_alphabet.txt \
--feature_cache /mnt/ds_models/temp_train/sources/feature_cache \
--train_files /mnt/datasets/cv/cv_v6/cv-it/clips/train.csv,/mnt/datasets/mitads-full/train_evalita.csv,/mnt/datasets/mitads-full/train_mailabs.csv,/mnt/datasets/mitads-full/train_mls.csv,/mnt/datasets/mitads-full/train_mspka.csv,/mnt/datasets/mitads-full/train_siwis.csv,/mnt/datasets/mitads-full/train_voxforge.csv \
--dev_files /mnt/datasets/cv/cv_v6/cv-it/clips/dev.csv,/mnt/datasets/mitads-full/dev_evalita.csv,/mnt/datasets/mitads-full/dev_mailabs.csv,/mnt/datasets/mitads-full/dev_mls.csv,/mnt/datasets/mitads-full/dev_mspka.csv,/mnt/datasets/mitads-full/dev_siwis.csv,/mnt/datasets/mitads-full/dev_voxforge.csv \
--train_batch_size 72 \
--dev_batch_size 72 \
--n_hidden 2048 \
--epochs 75 \
--learning_rate 0.0001 \
--force_initialize_learning_rate true \
--reduce_lr_on_plateau true \
--plateau_epochs 3 \
--plateau_reduction 0.1 \
--dropout_rate 0.25 \
--max_to_keep 3 \
--load_checkpoint_dir /mnt/ds_models/ckpts/en/deepspeech-0.9.3-checkpoint \
--save_checkpoint_dir /mnt/ds_models/ckpts/ita/deepspeech-0.9.3-checkpoint/exp1_transferlearning \
--export_dir /mnt/ds_models/transfer_model_0.9.3 \
--summary_dir /mnt/tboard_logs/exp1_transferlearning \
--log_level 1 \
--early_stop true \
--es_epochs 10 \

Test Result

CommonVoice Test Dataset - WER: 0.264702, CER: 0.107798, loss: 31.233347

MITADS-Speech Test Dataset - WER: 0.129138, CER: 0.041630, loss: 20.446140

Scorer Model

Previous Version 0.8 was used