This code accompanies the paper Video-realistic expressive audio-visual speech synthesis for the Greek language. You can get the preprint in ResearchGate.
Neutral |
Angry |
Happy |
Sad |
.
├── data # Code to extract audio-visual features for training. Modified from [HTS](http://hts.sp.nitech.ac.jp/)
├── hts # Code to train an expressive audio-visual talking head. Modified from [HTS](http://hts.sp.nitech.ac.jp/)
├── merlin # Code to train a DNN-based expressive audio-visual talking head. Modified from [Merlin](https://github.com/CSTR-Edinburgh/merlin)
├── aam_model # Code to synthesize the active appearance model from shape and texture features.
├── LICENSE
└── README.md
- You need to download the HTS toolkit and SPTK.
- HTK and Festival are needed only for some special features (e.g., if you want to create your own labels).
- For DNN-based synthesis you will need Theano and Python 2.7.
-
Download the CVSP-EAV dataset from here and extract it.If you wish to download the CVSP EAV dataset, email me at filby[at]central.ntua.gr -
Put the aam-model downloaded from the dataset (file all_emotions.mat) in the
aam_model/model/
directory. -
Download the STRAIGHT vocoder from https://github.com/HidekiKawahara/legacy_STRAIGHT.
-
Compile the mex files in
aam_model/mex
needed for the facial reconstruction by calling
make
-
In the
data/
subdirectory edit the Makefile to point to your system paths (The section you need to edit is marked with comments) and desired feature types (e.g., emotion) and outputs. -
Then, you need to extract STRAIGHT waveform features:
make straight
This step takes a lot of time, and the resulting features have a total size of around ~110GB.
- Then, extract the mel-generalized cepstral coefficients, the pitch, and the band-aperiodicy components:
make features
- Copy the folder
hts_style_labels
from the CVSP-EAV dataset into thedata/
subdirectory and rename the folder to justlabels
. - In the
data/
subdirectory create some additional files needed for training by running:
make labels
-
In the
hts/
subdirectory edit the configuration scriptConfiguration.pm
to point to your system paths (The section you need to edit is marked with comments) and configuration choices (e.g., select emotion). -
Train the HMM models:
./Training.pl Configuration.pm
This will take a lot of time (up to 5-6 hours) according to your system specs. If an error occurs during training, the steps up to that do not need to be repeated. You can select which training steps run from the switches in Configuration.pm
.
- You will find the output in
hts/gen
.
- Edit the
merlin/egs/greek/s1/scripts/setup.sh
to point to your system path (The section you need to edit is marked with comments) and configuration choices (e.g., select emotion). - Train the DNN models and generate output:
./run_full_voice.sh
- You will find the output in
merlin/egs/greek/s1/experiments
.
The code for HMM adaptation and interpolation is missing. Maybe I will add it at some time but it is currently not in my plans (it is old, buggy and a hassle to package).
Same with Adaptation and Interpolation, the code to train the AAM model from scratch is not provided. If you want the hand-labelled images and landmarks I used to train it you can e-mail. I used AAMtools from George Papandreou.
The code for the unit selection part of the paper is not available (Commercial software from Innoetics).
Special thanks to
- Nassos Katsamanis for his guidance during this project and initial codebase.
- Pyrros Tsiakoulis for his help in the unit selection part of the paper.
- George Papandreou for his code on active appearance models.
- Dimitra Tarousi for the recording of the CVSP-EAV database.
This project is licensed under the GPL v3 License - see the LICENSE file for details.