planning: ichigo framework Technical Report #171

PodsAreAllYouNeed · 2025-01-23T04:55:55Z

Overall Design Pattern

For the speech portion of the framework, we need to design around these 4 primitive models and the quantizer. Our research strategy will also be build around this and the package will thus also follow this structure.

This is not currently how things are designed in existing packages, for example Whisper is a single model.

To fit within our framework, we need to break it into:

Whisper Encoder = s2r
Whisper Decoder = r2t

In implementation, examples for how this will work is:

# ASR task
transcription = r2t(s2r(audio))

# TTS
speech = r2s(t2r(text))

How this will be implemented

r2t, t2r, r2s, s2r will all inhering nn.module, and implement forward methods that do exactly what their name says. They are generic classes, and the specific implementation should be handled through a yaml config.

ASR, TTS, Ichigo, are pipeline implementations of the fundamental models, and they will also be handled via a yaml config.

There can be built-in configs to achieve one-line implementations, such as for IchigoASR, but users should also be able to just define their own custom config.

Related work

This is fundamentally just extending the ideas of the framework invented by WhisperSpeech

Yip-Jia-Qi · 2025-02-11T10:50:44Z

The original source of this way of doing thing was SpeechT5
https://github.com/microsoft/SpeechT5

SpeechT5 was ahead of its time

it came out in october 2021 -> https://arxiv.org/abs/2110.07205
Whisper only came out Dec 2022 -> https://arxiv.org/abs/2212.04356
Soundstream only came out Jul 2021 https://arxiv.org/abs/2107.03312, around the same time, but codecs only really took off when encodec came out in october 2022 https://arxiv.org/abs/2210.13438

All these papers that came out after SpeechT5 are enabling us to realize the original vision of SpeechT5. The method and architecture looks very different because of all the new recent advancements but these are all along the same path

PodsAreAllYouNeed self-assigned this Jan 23, 2025

PodsAreAllYouNeed added this to Menlo Jan 23, 2025

github-project-automation bot moved this to Investigating in Menlo Jan 23, 2025

tuanlda78202 self-assigned this Jan 23, 2025

Yip-Jia-Qi assigned Yip-Jia-Qi and unassigned PodsAreAllYouNeed Feb 4, 2025

Yip-Jia-Qi added this to the publication milestone Feb 6, 2025

Yip-Jia-Qi changed the title ~~planning: ichigo framework overview~~ planning: ichigo framework Technical Report Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

planning: ichigo framework Technical Report #171

planning: ichigo framework Technical Report #171

PodsAreAllYouNeed commented Jan 23, 2025 •

edited by tuanlda78202

Loading

Yip-Jia-Qi commented Feb 11, 2025

planning: ichigo framework Technical Report #171

planning: ichigo framework Technical Report #171

Comments

PodsAreAllYouNeed commented Jan 23, 2025 • edited by tuanlda78202 Loading

Overall Design Pattern

How this will be implemented

Related work

Yip-Jia-Qi commented Feb 11, 2025

PodsAreAllYouNeed commented Jan 23, 2025 •

edited by tuanlda78202

Loading