This is a fine-tuned version of the Facebook MMS (Massively Multilingual Speech) model for Swahili Text-to-Speech (TTS). The model was fine-tuned to improve Swahili pronunciation and performance using custom audio datasets.
- Model Name: Swahili MMS TTS - Finetuned
- Languages Supported: Swahili
- Base Model: Facebook MMS
- Use Case: Text-to-Speech for Swahili language, suitable for generating high-quality speech from text.
The fine-tuning process was done using a custom dataset of Swahili voice samples to improve the fluency and accuracy of the original MMS model in Swahili. This resulted in enhanced pronunciation and natural-sounding speech for Swahili.
You can check out the code and process used in the fine-tuning by visiting the GitHub repository.
You can load and use the model directly from the Hugging Face model hub using either the pipeline
API or by manually downloading the model and tokenizer.
You can also download the model and tokenizer manually and run the text-to-speech pipeline without the Hugging Face pipeline
helper. Here's how:
import torch
import numpy as np
import scipy.io.wavfile
from transformers import VitsModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "Benjamin-png/swahili-mms-tts-finetuned"
text = "Habari, karibu kwenye mfumo wetu wa kusikiliza kwa Kiswahili."
audio_file_path = "swahili_speech.wav"
# Load model and tokenizer dynamically based on the provided model name
model = VitsModel.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Step 1: Tokenize the input text
inputs = tokenizer(text, return_tensors="pt").to(device)
# Step 2: Generate waveform
with torch.no_grad():
output = model(**inputs).waveform
# Step 3: Convert PyTorch tensor to NumPy array
output_np = output.squeeze().cpu().numpy()
# Step 4: Write to WAV file
scipy.io.wavfile.write(audio_file_path, rate=model.config.sampling_rate, data=output_np)
from transformers import pipeline
# Load the fine-tuned model
tts = pipeline("text-to-speech", model="Benjamin-png/swahili-mms-tts-finetuned")
# Generate speech from text
speech = tts("Habari, karibu kwenye mfumo wetu wa kusikiliza kwa Kiswahili.")
To save and play the audio, you can use the same methods mentioned above:
import soundfile as sf
# Save the audio as a WAV file
sf.write("swahili_speech.wav", output_np, model.config.sampling_rate)
You can play the audio using pydub
:
from pydub import AudioSegment
from pydub.playback import play
# Load and play the generated audio
audio = AudioSegment.from_wav("swahili_speech.wav")
play(audio)
Make sure to install the required libraries:
pip install torch transformers numpy soundfile scipy pydub
If you're interested in reproducing the fine-tuning process or using the model for similar purposes, you can check out the Google Colab notebook that outlines the entire process:
The notebook includes detailed steps on how to fine-tune the MMS model for Swahili TTS.
For further exploration and code snippets, visit the Source where you’ll find additional scripts, datasets, and instructions for customizing the model.
This project is licensed under the terms of the Apache License 2.0.