Voice Conversion

This is a collection of speech models including a speaker encoder model, a voice conversion model and a vocoder model together as a complete Voice Conversion pipeline.

The sepaker encoder model is an implementation make use of the GE2E loss and the code is from https://github.com/CorentinJ/Real-Time-Voice-Cloning

The voice conversion model is AutoVC and the code is from https://github.com/auspicious3000/autovc

The vocoder model is https://github.com/descriptinc/melgan-neurips

You must make appropriate changes (e.g. change the path of datasets, the model parameters in hparams.py or the train_xxx.py files) in order to run the code. I'm not going to explain the code since it's almost been a year since the last time I run the code, I don't remember the details :)

The workflow is quite simple though:

You collect some speech datasets (see dataset.py)
Then run the preprocess.py script to convert raw audios to mel features so we don't need to do the conversion on the fly while training
Train the speaker encoder model
Train the vocoder model
Train the voice conversion model, this depends on a trained speaker encoder model
Run the inference function in the train_vc.py script to do the conversion, this depends on all 3 models

The speaker encoder model and the vocoder model should be trained on a large dataset combined from many corpus (see dataset.py for all corpus I used).

The voice conversion model could be trained on a small number of speakers from one of the corpus, 120 speakers and 120 utterances per speaker is good enough to get sound performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Voice Conversion

Files

README.md

Latest commit

History

README.md

File metadata and controls

Voice Conversion