Skip to content

Commit

Permalink
Merge pull request #207 from idiap/docs
Browse files Browse the repository at this point in the history
Improve documentation
  • Loading branch information
eginhard authored Dec 12, 2024
2 parents f329072 + e38dcbe commit cd52907
Show file tree
Hide file tree
Showing 36 changed files with 562 additions and 690 deletions.
24 changes: 10 additions & 14 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,30 +11,25 @@ You can contribute not only with code but with bug reports, comments, questions,

If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers.

- [Development Road Map](https://github.com/coqui-ai/TTS/issues/378)

You can pick something out of our road map. We keep the progess of the project in this simple issue thread. It has new model proposals or developmental updates etc.

- [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues)

This is a place to find feature requests, bugs.

Issues with the ```good first issue``` tag are good place for beginners to take on.

-**PR**[pages](https://github.com/idiap/coqui-ai-TTS/pulls) with the ```🚀new version``` tag.

We list all the target improvements for the next version. You can pick one of them and start contributing.
Issues with the ```good first issue``` tag are good place for beginners to
take on. Issues tagged with `help wanted` are suited for more experienced
outside contributors.

- Also feel free to suggest new features, ideas and models. We're always open for new things.

## Call for sharing language models
## Call for sharing pretrained models
If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified.

This model can be shared in two ways:
1. Share the model files with us and we serve them with the next 🐸 TTS release.
2. Upload your models on GDrive and share the link.

Models are served under `.models.json` file and any model is available under TTS CLI or Server end points.
Models are served under `.models.json` file and any model is available under TTS
CLI and Python API end points.

Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930).

Expand Down Expand Up @@ -135,17 +130,18 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
13. Let's discuss until it is perfect. 💪

We might ask you for certain changes that would appear in the ✨**PR**'s page under 🐸TTS[https://github.com/idiap/coqui-ai-TTS/pulls].
We might ask you for certain changes that would appear in the
[Github ✨**PR**'s page](https://github.com/idiap/coqui-ai-TTS/pulls).
14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version.
## Development in Docker container
If you prefer working within a Docker container as your development environment, you can do the following:
1. Fork 🐸TTS[https://github.com/idiap/coqui-ai-TTS] by clicking the fork button at the top right corner of the project page.
1. Fork the 🐸TTS [Github repository](https://github.com/idiap/coqui-ai-TTS) by clicking the fork button at the top right corner of the page.
2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```.
```bash
git clone [email protected]:<your Github name>/coqui-ai-TTS.git
Expand Down
5 changes: 1 addition & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,6 @@ lint: ## run linters.
system-deps: ## install linux system deps
sudo apt-get install -y libsndfile1-dev

build-docs: ## build the docs
cd docs && make clean && make build

install: ## install 🐸 TTS
uv sync --all-extras

Expand All @@ -70,4 +67,4 @@ install_dev: ## install 🐸 TTS for development.
uv run pre-commit install

docs: ## build the docs
$(MAKE) -C docs clean && $(MAKE) -C docs html
uv run --group docs $(MAKE) -C docs clean && uv run --group docs $(MAKE) -C docs html
350 changes: 165 additions & 185 deletions README.md

Large diffs are not rendered by default.

129 changes: 64 additions & 65 deletions TTS/bin/synthesize.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,123 +14,122 @@
logger = logging.getLogger(__name__)

description = """
Synthesize speech on command line.
Synthesize speech on the command line.
You can either use your trained model or choose a model from the provided list.
If you don't specify any models, then it uses LJSpeech based English model.
#### Single Speaker Models
- List provided models:
```sh
tts --list_models
```
$ tts --list_models
```
- Get model info (for both tts_models and vocoder_models):
- Query by type/name:
The model_info_by_name uses the name as it from the --list_models.
```
$ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
```
For example:
```
$ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
$ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
```
- Query by type/idx:
The model_query_idx uses the corresponding idx from --list_models.
```
$ tts --model_info_by_idx "<model_type>/<model_query_idx>"
```
For example:
```
$ tts --model_info_by_idx tts_models/3
```
- Get model information. Use the names obtained from `--list_models`.
```sh
tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
```
For example:
```sh
tts --model_info_by_name tts_models/tr/common-voice/glow-tts
tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
```
- Query info for model info by full name:
```
$ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
```
#### Single Speaker Models
- Run TTS with default models:
- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`):
```
$ tts --text "Text for TTS" --out_path output/path/speech.wav
```sh
tts --text "Text for TTS" --out_path output/path/speech.wav
```
- Run TTS and pipe out the generated TTS wav file data:
```
$ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
```sh
tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
```
- Run a TTS model with its default vocoder model:
```
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
```sh
tts --text "Text for TTS" \\
--model_name "<model_type>/<language>/<dataset>/<model_name>" \\
--out_path output/path/speech.wav
```
For example:
```
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
```sh
tts --text "Text for TTS" \\
--model_name "tts_models/en/ljspeech/glow-tts" \\
--out_path output/path/speech.wav
```
- Run with specific TTS and vocoder models from the list:
- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model.
```
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
```sh
tts --text "Text for TTS" \\
--model_name "<model_type>/<language>/<dataset>/<model_name>" \\
--vocoder_name "<model_type>/<language>/<dataset>/<model_name>" \\
--out_path output/path/speech.wav
```
For example:
```
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
```sh
tts --text "Text for TTS" \\
--model_name "tts_models/en/ljspeech/glow-tts" \\
--vocoder_name "vocoder_models/en/ljspeech/univnet" \\
--out_path output/path/speech.wav
```
- Run your own TTS model (Using Griffin-Lim Vocoder):
- Run your own TTS model (using Griffin-Lim Vocoder):
```
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
```sh
tts --text "Text for TTS" \\
--model_path path/to/model.pth \\
--config_path path/to/config.json \\
--out_path output/path/speech.wav
```
- Run your own TTS and Vocoder models:
```
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
--vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
```sh
tts --text "Text for TTS" \\
--model_path path/to/model.pth \\
--config_path path/to/config.json \\
--out_path output/path/speech.wav \\
--vocoder_path path/to/vocoder.pth \\
--vocoder_config_path path/to/vocoder_config.json
```
#### Multi-speaker Models
- List the available speakers and choose a <speaker_id> among them:
- List the available speakers and choose a `<speaker_id>` among them:
```
$ tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs
```sh
tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs
```
- Run the multi-speaker TTS model with the target speaker ID:
```
$ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --speaker_idx <speaker_id>
```sh
tts --text "Text for TTS." --out_path output/path/speech.wav \\
--model_name "<language>/<dataset>/<model_name>" --speaker_idx <speaker_id>
```
- Run your own multi-speaker TTS model:
```
$ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
```sh
tts --text "Text for TTS" --out_path output/path/speech.wav \\
--model_path path/to/model.pth --config_path path/to/config.json \\
--speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
```
### Voice Conversion Models
#### Voice Conversion Models
```
$ tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
```sh
tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" \\
--source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
```
"""

Expand Down
2 changes: 1 addition & 1 deletion TTS/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
class BaseTrainerModel(TrainerModel):
"""BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS.
Every new 🐸TTS model must inherit it.
Every new Coqui model must inherit it.
"""

@staticmethod
Expand Down
10 changes: 6 additions & 4 deletions TTS/tts/models/bark.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,12 +206,14 @@ def synthesize(
speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in
`voice_dirs` with the name `speaker_id`. Defaults to None.
voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None.
**kwargs: Model specific inference settings used by `generate_audio()` and `TTS.tts.layers.bark.inference_funcs.generate_text_semantic().
**kwargs: Model specific inference settings used by `generate_audio()` and
`TTS.tts.layers.bark.inference_funcs.generate_text_semantic()`.
Returns:
A dictionary of the output values with `wav` as output waveform, `deterministic_seed` as seed used at inference,
`text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents`
as latents used at inference.
A dictionary of the output values with `wav` as output waveform,
`deterministic_seed` as seed used at inference, `text_input` as text token IDs
after tokenizer, `voice_samples` as samples used for cloning,
`conditioning_latents` as latents used at inference.
"""
speaker_id = "random" if speaker_id is None else speaker_id
Expand Down
10 changes: 6 additions & 4 deletions TTS/tts/models/base_tts.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,15 +80,17 @@ def _set_model_args(self, config: Coqpit):
raise ValueError("config must be either a *Config or *Args")

def init_multispeaker(self, config: Coqpit, data: List = None):
"""Initialize a speaker embedding layer if needen and define expected embedding channel size for defining
`in_channels` size of the connected layers.
"""Set up for multi-speaker TTS.
Initialize a speaker embedding layer if needed and define expected embedding
channel size for defining `in_channels` size of the connected layers.
This implementation yields 3 possible outcomes:
1. If `config.use_speaker_embedding` and `config.use_d_vector_file are False, do nothing.
1. If `config.use_speaker_embedding` and `config.use_d_vector_file` are False, do nothing.
2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512.
3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of
`config.d_vector_dim` or 512.
`config.d_vector_dim` or 512.
You can override this function for new models.
Expand Down
45 changes: 23 additions & 22 deletions TTS/tts/models/overflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,32 +33,33 @@ class Overflow(BaseTTS):
Paper abstract::
Neural HMMs are a type of neural transducer recently proposed for
sequence-to-sequence modelling in text-to-speech. They combine the best features
of classic statistical speech synthesis and modern neural TTS, requiring less
data and fewer training updates, and are less prone to gibberish output caused
by neural attention failures. In this paper, we combine neural HMM TTS with
normalising flows for describing the highly non-Gaussian distribution of speech
acoustics. The result is a powerful, fully probabilistic model of durations and
acoustics that can be trained using exact maximum likelihood. Compared to
dominant flow-based acoustic models, our approach integrates autoregression for
improved modelling of long-range dependences such as utterance-level prosody.
Experiments show that a system based on our proposal gives more accurate
pronunciations and better subjective speech quality than comparable methods,
whilst retaining the original advantages of neural HMMs. Audio examples and code
are available at https://shivammehta25.github.io/OverFlow/.
sequence-to-sequence modelling in text-to-speech. They combine the best features
of classic statistical speech synthesis and modern neural TTS, requiring less
data and fewer training updates, and are less prone to gibberish output caused
by neural attention failures. In this paper, we combine neural HMM TTS with
normalising flows for describing the highly non-Gaussian distribution of speech
acoustics. The result is a powerful, fully probabilistic model of durations and
acoustics that can be trained using exact maximum likelihood. Compared to
dominant flow-based acoustic models, our approach integrates autoregression for
improved modelling of long-range dependences such as utterance-level prosody.
Experiments show that a system based on our proposal gives more accurate
pronunciations and better subjective speech quality than comparable methods,
whilst retaining the original advantages of neural HMMs. Audio examples and code
are available at https://shivammehta25.github.io/OverFlow/.
Note:
- Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities
of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning
If you change the dataset or want to regenerate the parameters change the `force_generate_statistics` and
`mel_statistics_parameter_path` accordingly.
- Neural HMMs uses flat start initialization i.e it computes the means
and std and transition probabilities of the dataset and uses them to initialize
the model. This benefits the model and helps with faster learning If you change
the dataset or want to regenerate the parameters change the
`force_generate_statistics` and `mel_statistics_parameter_path` accordingly.
- To enable multi-GPU training, set the `use_grad_checkpointing=False` in config.
This will significantly increase the memory usage. This is because to compute
the actual data likelihood (not an approximation using MAS/Viterbi) we must use
all the states at the previous time step during the forward pass to decide the
probability distribution at the current step i.e the difference between the forward
algorithm and viterbi approximation.
This will significantly increase the memory usage. This is because to compute
the actual data likelihood (not an approximation using MAS/Viterbi) we must use
all the states at the previous time step during the forward pass to decide the
probability distribution at the current step i.e the difference between the forward
algorithm and viterbi approximation.
Check :class:`TTS.tts.configs.overflow.OverFlowConfig` for class arguments.
"""
Expand Down
6 changes: 4 additions & 2 deletions TTS/tts/models/tortoise.py
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,9 @@ def get_conditioning_latents(
Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
properties.
:param voice_samples: List of arbitrary reference clips, which should be *pairs* of torch tensors containing arbitrary kHz waveform data.
:param voice_samples: List of arbitrary reference clips, which should be *pairs*
of torch tensors containing arbitrary kHz waveform data.
:param latent_averaging_mode: 0/1/2 for following modes:
0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples
1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks
Expand Down Expand Up @@ -671,7 +673,7 @@ def inference(
As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
are the "mean" prediction of the diffusion network and will sound bland and smeared.
hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive transformer.
hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer.
Extra keyword args fed to this function get forwarded directly to that API. Documentation
here: https://huggingface.co/docs/transformers/internal/generation_utils
Expand Down
Loading

0 comments on commit cd52907

Please sign in to comment.