Fish TTS API Fails to Match Reference Audio Tone and Style #836

AshutoshMipax · 2025-01-17T12:50:42Z

Self Checks

This template is only for bug reports. For questions, please visit Discussions.
I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文日本語 Portuguese (Brazil)
I have searched for existing issues, including closed ones. Search issues
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source), Self Hosted (Docker)

Environment Details

Environment Details
Operating System: Windows 11 (fully updated)
Processor: Intel Core i5 13th Gen
GPU: NVIDIA RTX 4050
Python Version: Python 3.12
Relevant Libraries and Versions:
torch: 2.4.1
Gradio: 4.44.0
pydub: Latest version installed via pip
ffmpeg: Installed and accessible via system PATH (version: 2024-08-01-git)

Steps to Reproduce

Install Fish TTS and dependencies as per the documentation.
Run the following code to use the Fish TTS API:
i have included the file for the code at the end of the document

from gradio_client import Client, handle_file

client = Client("http://127.0.0.1:7860/")
result = client.predict(
text="This is a test input.",
normalize=True,
reference_id="test_reference",
reference_audio=handle_file(r"C:\Users\ashu4\Music\Sound\final_new_vocal.wav"),
reference_text="",
max_new_tokens=0,
chunk_length=200,
top_p=0.7,
repetition_penalty=1.2,
temperature=0.7,
seed=0,
use_memory_cache="on",
api_name="/partial"
)
print(result)
Observe the results:
The generated audio chunks do not match the tone, speed, or style of the provided reference audio.
In some cases, the first chunk is synthesized as a female voice and the second as a male voice.
Stitch the chunks using the following code:
python
Copy
Edit
from pydub import AudioSegment

final_audio = AudioSegment.empty()
for chunk_path in ["chunk1.wav", "chunk2.wav"]: # Replace with actual chunk paths
final_audio += AudioSegment.from_file(chunk_path)
final_audio.export("final_output.wav", format="wav")
The final output is inconsistent and does not replicate the reference audio style.

fish.py.txt

✔️ Expected Behavior

The generated audio should replicate the tone, speed, and style of the reference audio provided in the reference_audio parameter.
All audio chunks should be consistent in voice, tone, and style.

❌ Actual Behavior

The generated audio:
Does not match the tone, speed, or style of the provided reference audio.
Is inconsistent between chunks (e.g., one chunk is in a male voice, another in a female voice).
When running the same input in the Gradio UI, the results are far better and match the reference audio, indicating that the API may not be fully utilizing GPU resources or properly processing the reference audio.

AshutoshMipax added the bug Something isn't working label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fish TTS API Fails to Match Reference Audio Tone and Style #836

Fish TTS API Fails to Match Reference Audio Tone and Style #836

AshutoshMipax commented Jan 17, 2025 •

edited

Loading

Fish TTS API Fails to Match Reference Audio Tone and Style #836

Fish TTS API Fails to Match Reference Audio Tone and Style #836

Comments

AshutoshMipax commented Jan 17, 2025 • edited Loading

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

AshutoshMipax commented Jan 17, 2025 •

edited

Loading