Merge pull request #207 from idiap/docs

Improve documentation
idiap · Dec 12, 2024 · cd52907 · cd52907
2 parents f329072 + e38dcbe
commit cd52907
Show file tree

Hide file tree

Showing 36 changed files with 562 additions and 690 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -11,30 +11,25 @@ You can contribute not only with code but with bug reports, comments, questions,
 
 If you like to contribute code, squash a bug but if you don't know where to start, here are some pointers.
 
-- [Development Road Map](https://github.com/coqui-ai/TTS/issues/378)
-
-    You can pick something out of our road map. We keep the progess of the project in this simple issue thread. It has new model proposals or developmental updates etc.
-
 - [Github Issues Tracker](https://github.com/idiap/coqui-ai-TTS/issues)
 
     This is a place to find feature requests, bugs.
 
-    Issues with the ```good first issue``` tag are good place for beginners to take on.
-
-- ✨**PR**✨ [pages](https://github.com/idiap/coqui-ai-TTS/pulls) with the ```🚀new version``` tag.
-
-    We list all the target improvements for the next version. You can pick one of them and start contributing.
+    Issues with the ```good first issue``` tag are good place for beginners to
+    take on. Issues tagged with `help wanted` are suited for more experienced
+    outside contributors.
 
 - Also feel free to suggest new features, ideas and models. We're always open for new things.
 
-## Call for sharing language models
+## Call for sharing pretrained models
 If possible, please consider sharing your pre-trained models in any language (if the licences allow for you to do so). We will include them in our model catalogue for public use and give the proper attribution, whether it be your name, company, website or any other source specified.
 
 This model can be shared in two ways:
 1. Share the model files with us and we serve them with the next 🐸 TTS release.
 2. Upload your models on GDrive and share the link.
 
-Models are served under `.models.json` file and any model is available under TTS CLI or Server end points.
+Models are served under `.models.json` file and any model is available under TTS
+CLI and Python API end points.
 
 Either way you choose, please make sure you send the models [here](https://github.com/coqui-ai/TTS/discussions/930).
 
@@ -135,17 +130,18 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
 
 13. Let's discuss until it is perfect. 💪
 
-    We might ask you for certain changes that would appear in the ✨**PR**✨'s page under 🐸TTS[https://github.com/idiap/coqui-ai-TTS/pulls].
+    We might ask you for certain changes that would appear in the
+    [Github ✨**PR**✨'s page](https://github.com/idiap/coqui-ai-TTS/pulls).
 
 14. Once things look perfect, We merge it to the ```dev``` branch and make it ready for the next version.
 
 ## Development in Docker container
 
 If you prefer working within a Docker container as your development environment, you can do the following:
 
-1. Fork 🐸TTS[https://github.com/idiap/coqui-ai-TTS] by clicking the fork button at the top right corner of the project page.
+1. Fork the 🐸TTS [Github repository](https://github.com/idiap/coqui-ai-TTS) by clicking the fork button at the top right corner of the page.
 
-2. Clone 🐸TTS and add the main repo as a new remote named ```upsteam```.
+2. Clone 🐸TTS and add the main repo as a new remote named ```upstream```.
 
     ```bash
     git clone [email protected]:<your Github name>/coqui-ai-TTS.git

diff --git a/Makefile b/Makefile
@@ -59,9 +59,6 @@ lint:	## run linters.
 system-deps:	## install linux system deps
 	sudo apt-get install -y libsndfile1-dev
 
-build-docs: ## build the docs
-	cd docs && make clean && make build
-
 install:	## install 🐸 TTS
 	uv sync --all-extras
 
@@ -70,4 +67,4 @@ install_dev:	## install 🐸 TTS for development.
 	uv run pre-commit install
 
 docs:	## build the docs
-	$(MAKE) -C docs clean && $(MAKE) -C docs html
+	uv run --group docs $(MAKE) -C docs clean && uv run --group docs $(MAKE) -C docs html
diff --git a/README.md b/README.md
diff --git a/TTS/bin/synthesize.py b/TTS/bin/synthesize.py
@@ -14,123 +14,122 @@
 logger = logging.getLogger(__name__)
 
 description = """
-Synthesize speech on command line.
+Synthesize speech on the command line.
 
 You can either use your trained model or choose a model from the provided list.
 
-If you don't specify any models, then it uses LJSpeech based English model.
-
-#### Single Speaker Models
-
 - List provided models:
 
+  ```sh
+  tts --list_models
   ```
-  $ tts --list_models
-  ```
-
-- Get model info (for both tts_models and vocoder_models):
-
-  - Query by type/name:
-    The model_info_by_name uses the name as it from the --list_models.
-    ```
-    $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
-    ```
-    For example:
-    ```
-    $ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
-    $ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
-    ```
-  - Query by type/idx:
-    The model_query_idx uses the corresponding idx from --list_models.
 
-    ```
-    $ tts --model_info_by_idx "<model_type>/<model_query_idx>"
-    ```
-
-    For example:
-
-    ```
-    $ tts --model_info_by_idx tts_models/3
-    ```
+- Get model information. Use the names obtained from `--list_models`.
+  ```sh
+  tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
+  ```
+  For example:
+  ```sh
+  tts --model_info_by_name tts_models/tr/common-voice/glow-tts
+  tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
+  ```
 
-  - Query info for model info by full name:
-    ```
-    $ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
-    ```
+#### Single Speaker Models
 
-- Run TTS with default models:
+- Run TTS with the default model (`tts_models/en/ljspeech/tacotron2-DDC`):
 
-  ```
-  $ tts --text "Text for TTS" --out_path output/path/speech.wav
+  ```sh
+  tts --text "Text for TTS" --out_path output/path/speech.wav
   ```
 
 - Run TTS and pipe out the generated TTS wav file data:
 
-  ```
-  $ tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
+  ```sh
+  tts --text "Text for TTS" --pipe_out --out_path output/path/speech.wav | aplay
   ```
 
 - Run a TTS model with its default vocoder model:
 
-  ```
-  $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
+  ```sh
+  tts --text "Text for TTS" \\
+      --model_name "<model_type>/<language>/<dataset>/<model_name>" \\
+      --out_path output/path/speech.wav
   ```
 
   For example:
 
-  ```
-  $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
+  ```sh
+  tts --text "Text for TTS" \\
+      --model_name "tts_models/en/ljspeech/glow-tts" \\
+      --out_path output/path/speech.wav
   ```
 
-- Run with specific TTS and vocoder models from the list:
+- Run with specific TTS and vocoder models from the list. Note that not every vocoder is compatible with every TTS model.
 
-  ```
-  $ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
+  ```sh
+  tts --text "Text for TTS" \\
+      --model_name "<model_type>/<language>/<dataset>/<model_name>" \\
+      --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" \\
+      --out_path output/path/speech.wav
   ```
 
   For example:
 
-  ```
-  $ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
+  ```sh
+  tts --text "Text for TTS" \\
+      --model_name "tts_models/en/ljspeech/glow-tts" \\
+      --vocoder_name "vocoder_models/en/ljspeech/univnet" \\
+      --out_path output/path/speech.wav
   ```
 
-- Run your own TTS model (Using Griffin-Lim Vocoder):
+- Run your own TTS model (using Griffin-Lim Vocoder):
 
-  ```
-  $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
+  ```sh
+  tts --text "Text for TTS" \\
+      --model_path path/to/model.pth \\
+      --config_path path/to/config.json \\
+      --out_path output/path/speech.wav
   ```
 
 - Run your own TTS and Vocoder models:
 
-  ```
-  $ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
-      --vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
+  ```sh
+  tts --text "Text for TTS" \\
+      --model_path path/to/model.pth \\
+      --config_path path/to/config.json \\
+      --out_path output/path/speech.wav \\
+      --vocoder_path path/to/vocoder.pth \\
+      --vocoder_config_path path/to/vocoder_config.json
   ```
 
 #### Multi-speaker Models
 
-- List the available speakers and choose a <speaker_id> among them:
+- List the available speakers and choose a `<speaker_id>` among them:
 
-  ```
-  $ tts --model_name "<language>/<dataset>/<model_name>"  --list_speaker_idxs
+  ```sh
+  tts --model_name "<language>/<dataset>/<model_name>"  --list_speaker_idxs
   ```
 
 - Run the multi-speaker TTS model with the target speaker ID:
 
-  ```
-  $ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>"  --speaker_idx <speaker_id>
+  ```sh
+  tts --text "Text for TTS." --out_path output/path/speech.wav \\
+      --model_name "<language>/<dataset>/<model_name>"  --speaker_idx <speaker_id>
   ```
 
 - Run your own multi-speaker TTS model:
 
-  ```
-  $ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
+  ```sh
+  tts --text "Text for TTS" --out_path output/path/speech.wav \\
+      --model_path path/to/model.pth --config_path path/to/config.json \\
+      --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
   ```
 
-### Voice Conversion Models
+#### Voice Conversion Models
 
-```
-$ tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
+```sh
+tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" \\
+    --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav>
 ```
 """
 

diff --git a/TTS/model.py b/TTS/model.py
@@ -12,7 +12,7 @@
 class BaseTrainerModel(TrainerModel):
     """BaseTrainerModel model expanding TrainerModel with required functions by 🐸TTS.
 
-    Every new 🐸TTS model must inherit it.
+    Every new Coqui model must inherit it.
     """
 
     @staticmethod

diff --git a/TTS/tts/models/bark.py b/TTS/tts/models/bark.py
@@ -206,12 +206,14 @@ def synthesize(
             speaker_wav (str): Path to the speaker audio file for cloning a new voice. It is cloned and saved in
                 `voice_dirs` with the name `speaker_id`. Defaults to None.
             voice_dirs (List[str]): List of paths that host reference audio files for speakers. Defaults to None.
-            **kwargs: Model specific inference settings used by `generate_audio()` and `TTS.tts.layers.bark.inference_funcs.generate_text_semantic().
+            **kwargs: Model specific inference settings used by `generate_audio()` and
+                      `TTS.tts.layers.bark.inference_funcs.generate_text_semantic()`.
 
         Returns:
-            A dictionary of the output values with `wav` as output waveform, `deterministic_seed` as seed used at inference,
-            `text_input` as text token IDs after tokenizer, `voice_samples` as samples used for cloning, `conditioning_latents`
-            as latents used at inference.
+            A dictionary of the output values with `wav` as output waveform,
+            `deterministic_seed` as seed used at inference, `text_input` as text token IDs
+            after tokenizer, `voice_samples` as samples used for cloning,
+            `conditioning_latents` as latents used at inference.
 
         """
         speaker_id = "random" if speaker_id is None else speaker_id

diff --git a/TTS/tts/models/base_tts.py b/TTS/tts/models/base_tts.py
@@ -80,15 +80,17 @@ def _set_model_args(self, config: Coqpit):
             raise ValueError("config must be either a *Config or *Args")
 
     def init_multispeaker(self, config: Coqpit, data: List = None):
-        """Initialize a speaker embedding layer if needen and define expected embedding channel size for defining
-        `in_channels` size of the connected layers.
+        """Set up for multi-speaker TTS.
+
+        Initialize a speaker embedding layer if needed and define expected embedding
+        channel size for defining `in_channels` size of the connected layers.
 
         This implementation yields 3 possible outcomes:
 
-        1. If `config.use_speaker_embedding` and `config.use_d_vector_file are False, do nothing.
+        1. If `config.use_speaker_embedding` and `config.use_d_vector_file` are False, do nothing.
         2. If `config.use_d_vector_file` is True, set expected embedding channel size to `config.d_vector_dim` or 512.
         3. If `config.use_speaker_embedding`, initialize a speaker embedding layer with channel size of
-        `config.d_vector_dim` or 512.
+           `config.d_vector_dim` or 512.
 
         You can override this function for new models.
 

diff --git a/TTS/tts/models/overflow.py b/TTS/tts/models/overflow.py
@@ -33,32 +33,33 @@ class Overflow(BaseTTS):
 
     Paper abstract::
         Neural HMMs are a type of neural transducer recently proposed for
-    sequence-to-sequence modelling in text-to-speech. They combine the best features
-    of classic statistical speech synthesis and modern neural TTS, requiring less
-    data and fewer training updates, and are less prone to gibberish output caused
-    by neural attention failures. In this paper, we combine neural HMM TTS with
-    normalising flows for describing the highly non-Gaussian distribution of speech
-    acoustics. The result is a powerful, fully probabilistic model of durations and
-    acoustics that can be trained using exact maximum likelihood. Compared to
-    dominant flow-based acoustic models, our approach integrates autoregression for
-    improved modelling of long-range dependences such as utterance-level prosody.
-    Experiments show that a system based on our proposal gives more accurate
-    pronunciations and better subjective speech quality than comparable methods,
-    whilst retaining the original advantages of neural HMMs. Audio examples and code
-    are available at https://shivammehta25.github.io/OverFlow/.
+        sequence-to-sequence modelling in text-to-speech. They combine the best features
+        of classic statistical speech synthesis and modern neural TTS, requiring less
+        data and fewer training updates, and are less prone to gibberish output caused
+        by neural attention failures. In this paper, we combine neural HMM TTS with
+        normalising flows for describing the highly non-Gaussian distribution of speech
+        acoustics. The result is a powerful, fully probabilistic model of durations and
+        acoustics that can be trained using exact maximum likelihood. Compared to
+        dominant flow-based acoustic models, our approach integrates autoregression for
+        improved modelling of long-range dependences such as utterance-level prosody.
+        Experiments show that a system based on our proposal gives more accurate
+        pronunciations and better subjective speech quality than comparable methods,
+        whilst retaining the original advantages of neural HMMs. Audio examples and code
+        are available at https://shivammehta25.github.io/OverFlow/.
 
     Note:
-        - Neural HMMs uses flat start initialization i.e it computes the means and std and transition probabilities
-        of the dataset and uses them to initialize the model. This benefits the model and helps with faster learning
-        If you change the dataset or want to regenerate the parameters change the `force_generate_statistics` and
-        `mel_statistics_parameter_path` accordingly.
+        - Neural HMMs uses flat start initialization i.e it computes the means
+          and std and transition probabilities of the dataset and uses them to initialize
+          the model. This benefits the model and helps with faster learning If you change
+          the dataset or want to regenerate the parameters change the
+          `force_generate_statistics` and `mel_statistics_parameter_path` accordingly.
 
         - To enable multi-GPU training, set the `use_grad_checkpointing=False` in config.
-        This will significantly increase the memory usage.  This is because to compute
-        the actual data likelihood (not an approximation using MAS/Viterbi) we must use
-        all the states at the previous time step during the forward pass to decide the
-        probability distribution at the current step i.e the difference between the forward
-        algorithm and viterbi approximation.
+          This will significantly increase the memory usage.  This is because to compute
+          the actual data likelihood (not an approximation using MAS/Viterbi) we must use
+          all the states at the previous time step during the forward pass to decide the
+          probability distribution at the current step i.e the difference between the forward
+          algorithm and viterbi approximation.
 
     Check :class:`TTS.tts.configs.overflow.OverFlowConfig` for class arguments.
     """

diff --git a/TTS/tts/models/tortoise.py b/TTS/tts/models/tortoise.py
@@ -423,7 +423,9 @@ def get_conditioning_latents(
         Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
         These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
         properties.
-        :param voice_samples: List of arbitrary reference clips, which should be *pairs* of torch tensors containing arbitrary kHz waveform data.
+
+        :param voice_samples: List of arbitrary reference clips, which should be *pairs*
+                              of torch tensors containing arbitrary kHz waveform data.
         :param latent_averaging_mode: 0/1/2 for following modes:
             0 - latents will be generated as in original tortoise, using ~4.27s from each voice sample, averaging latent across all samples
             1 - latents will be generated using (almost) entire voice samples, averaged across all the ~4.27s chunks
@@ -671,7 +673,7 @@ def inference(
                 As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
             diffusion_temperature: (float) Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
                                       are the "mean" prediction of the diffusion network and will sound bland and smeared.
-            hf_generate_kwargs: (**kwargs) The huggingface Transformers generate API is used for the autoregressive transformer.
+            hf_generate_kwargs: (`**kwargs`) The huggingface Transformers generate API is used for the autoregressive transformer.
                                     Extra keyword args fed to this function get forwarded directly to that API. Documentation
                                     here: https://huggingface.co/docs/transformers/internal/generation_utils