2023.9.29: Some explanations have been provided for the espnet environment of the back-end ASR system
2023.9.22: We have fixed several bugs. If you have downloaded the code before, please download it again to synchronize it.
MISP 2023 challenge focuses on the Audio-Visual Target Speaker Extraction (AVTSE) task. For a description of the task setting, dataset and baseline, please refer to the challenge overview paper.
We use the ASR backend to decode the extracted speech and calculate CER as an evaluation index. The results for the baseline system are as follows:
Systems | S | D | I | CER |
---|---|---|---|---|
Beamforming | 31.0 | 7.2 | 4.8 | 43.0 |
GSS | 20.0 | 4.2 | 2.2 | 26.4 |
GSS+AEASE | 21.5 | 4.9 | 2.2 | 28.6 |
GSS+MEASE | 20.4 | 4.4 | 2.2 | 27.0 |
GSS+MEASE+Finetune | 19.8 | 4.7 | 1.8 | 26.3 |
The results of GSS+MEASE+Finetune are used as the results of the baseline system on development set. Furthermore, we calculated the DNSMOS P.835 [15] as a reference to explore the relationship between speech auditory quality and back-end tasks. Please refer to the overview paper.
For evaluation set, the results for the baseline system are shown below:
Systems | S | D | I | CER |
---|---|---|---|---|
Beamforming | 38.3 | 11.5 | 4.4 | 54.2 |
GSS | 27.8 | 7.2 | 2.5 | 37.6 |
GSS+MEASE+Finetune | 26.7 | 7.1 | 2.3 | 36.1 |
The baseline code mainly includes three parts, namely data preparation and simulation, MEASE model training and decoding, and back-end ASR decoding. Among them, the MEASE model is an audio-visual speech enhancement model. You can refer to this paper.
Before starting, you need to configure Kaldi, Espnet and GPU-GSS environments. You must configure the GPU-GSS environment according to this link, otherwise errors will occur.
In this step, we perform data preparation, data simulation, and GSS.
- Data preparation
First, we cut relatively clean near-field audio based on the DNSMOS score, and then prepare related files. The list of clean audio is in /simulation/data/train_clean_config/segments. Please note that these data are not unique, you can also perform data cleaning according to your own standards.
cd simulation
bash prepare_data.sh # (Please change your file path in the script.)
- Data simulation
Second, we use the clean near-field audio and noise data to simulate 6-channels far-field audio. The simulation script is mainly based on this link. Please note that simulation methods are not unique, you can use other simulation methods in MISP 2023 challenge.
bash run_simulation.sh # (Please change your file path in the script.)
- GSS
Then, we perform GSS on simulation data, real development set data, and real training set data.
cd simulation/gss
bash run.sh # (Please change your file path in the script.)
cd simulation/gss_main/gss_main
bash prepare_gss_dev.sh
bash prepare_gss_train.sh
- Pre-trained model
The mee.tar in the directory /model_src/pretrainModels is a pre-trained multimodal embedding extractor. It is frozen during the training of MEASE. The mease.tar is a pre-trained MEASE model that can be used to initialize MEASE for joint optimization of MEASE and ASR backend. The mease_asr_jointOptimization.tar is the pre-trained MEASE_Finetune model. The valid.acc.ave_5best.pth is the pre-trained ASR model.
You can download these models in this link of Google drive: https://drive.google.com/drive/folders/16TuN4_ClSCSo4YRUOFwGOt-lmA16v5bg.
After that, you need to put mee.tar, mease.tar, mease_asr_jointOptimization.tar in MEASE/model_src/pretrainModels/,
and put valid.acc.ave_5best.pth in MEASE /model_src/asr_file/ and asr/exp/a/a_conformer_farmidnear/.
- MEASE data prepare
First, prepare the json files for MEASE.
cd MEASE/data_prepare
bash MEASE/data_prepare/prepare_data.sh # (Please change your file path in the script.)
For the training set, generate the cmvn files required for training .
python feature_cmvn.py --annotate "../data_prepare/misp_sim_trainset.json" #generate cmvn files for training MEASE
python feature_cmvn.py --annotate "../data_prepare/misp_real_trainset.json" #generate cmvn files for joint optimization of MEASE and ASR backend
Please change some file paths in the feature_cmvn.py to your own path.
The directory /model_src/experiment_config contains the following four files:
0_1_MISP_Mease.yml ##Configuration file for training MEASE
0_1_MISP_Mease_test.yml ##Configuration file for decoding real test set with the MEASE.
1_1_MISP_Mease_ASR_jointOptimization.yml ##Configuration file for joint optimization of MEASE and ASR backend
1_2_MISP_Mease_ASR_jointOptimization_test.yml ##Configuration file for decoding real test set with MEASE after joint training
Please change some file paths in the configuration files to your own path.
- MEASE model training
cd MEASE/model_src
## Train the MEASE model
source run.sh 0.1
## Joint optimization of MEASE and ASR backend. Before joint training, you need to configure the espnet environment according to the instructions on the back-end ASR system below.
source run.sh 0.3
Please change some file paths in the script to your own path.
- Decoding the real test set with the official pre-trained models
cd model_src
## Decoding real test set with MEASE.
source run.sh 0.0.1
## Decoding real test set with MEASE after joint training.
source run.sh 0.0.2
- Decoding the real test set with the your models
cd model_src
## Decoding real test set with MEASE.
source run.sh 0.2
## Decoding real test set with MEASE after joint training.
source run.sh 0.4
Please change some file paths in the script to your own path.
-
Env Setup
After installing espnet and kaldi, you are supposed to replace the directories named "espnet" and "espnet2" in your customized Espnet package with the corresponding directories from the repository. Then enter the workplace directory ./asr and set the standard espnet directories like utils, steps, scripts, and pyscripts and customize .path and .barsh files. Here is a reference script.
-
Prepare Kaldi-format data directory
As shown in the example in the dir, you need to prepare 5 kaldi files, namely
wav.scp, utt2spk, text, text_shape.char, and speech_shape
. In addition, in order to keep everyone consistent, we offer an stardard UID list for final test. Please note that the timestamp in the UID is rounded up to the nearest multiple of 4. Therefore, you need to round up the timestamp to match it with the UID. If you use our baseline code, you don't need to think about it since the rounding is already done when doing GSS. -
Released ASR model
You can download the pre-training ASR models from here(valid.acc.ave_5best.pth) and place it in the
asr/exp/a_conformer_farmidnear
directory with the namevalid.acc.ave_5best.pth
. Please note that any update to the ASR model is not allowed during test-time (back-end frozen). Here are some details about the ASR model. It consists of a 12-layer conformer as the encoder and a 6-layer Transformer as the decoder. You can find more information about the model's architecture in the train_asr.yaml file. During the ASR training, utterances from far, middle, and near fields are used. GPU-gss is applied to the far and middle utterances. Various data augmentation techniques are applied to far-field audios, including adding noise+Room Impulse Response (RIR) convolution (3-fold), speed perturbation (2-fold with speeds 0.9 and 1.1), and concatenating nearby segments (2-fold) to create a 10-fold training set. -
Run Decoding
./decoder_asr.sh
# options:
test_sets= #customized kalid-like eval datadir dump/raw/dev_far_farlip
asr_exp=exp/a/a_conformer_farmidnear #model_path and the model config file is $asr_exp/confing.yaml
inference_config=conf/decode_asr.yaml #decode config
use_lm=false #LM is forbidden
# stage 1 decoding stage 2 scoring
# If you encounter a "Permission denied" issue, you can resolve it using the chmod a+x command.
-
Kaldi
-
Espnet
-
pyrirgen
-
GPU-GSS
-
Python Package:
pytorch
argparse
numpy
tqdm
s3prl
opencv-python
resampy
einops
g2p_en
scipy
prefetch_generator
matplotlib
soundfile
opencv-python
matplotlib
tensorboard
If you find this code useful in your research, please consider to cite the following papers:
@misc{wu2023multimodal,
title={The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction},
author={Shilong Wu and Chenxi Wang and Hang Chen and Yusheng Dai and Chenyue Zhang and Ruoyu Wang and Hongbo Lan and Jun Du and Chin-Hui Lee and Jingdong Chen and Shinji Watanabe and Sabato Marco Siniscalchi and Odette Scharenborg and Zhong-Qiu Wang and Jia Pan and Jianqing Gao},
year={2023},
eprint={2309.08348},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
@article{chen2021correlating,
title={Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement},
author={Chen, Hang and Du, Jun and Hu, Yu and Dai, Li-Rong and Yin, Bao-Cai and Lee, Chin-Hui},
journal={Neural Networks},
volume={143},
pages={171--182},
year={2021},
publisher={Elsevier}
}
@inproceedings{dai2023improving,
title={Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder},
author={Dai, Yusheng and Chen, Hang and Du, Jun and Ding, Xiaofei and Ding, Ning and Jiang, Feijun and Lee, Chin-Hui},
booktitle={2023 IEEE International Conference on Multimedia and Expo (ICME)},
pages={2627--2632},
year={2023},
organization={IEEE}
}