You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the speech portion of the framework, we need to design around these 4 primitive models and the quantizer. Our research strategy will also be build around this and the package will thus also follow this structure.
This is not currently how things are designed in existing packages, for example Whisper is a single model.
To fit within our framework, we need to break it into:
Whisper Encoder = s2r
Whisper Decoder = r2t
In implementation, examples for how this will work is:
# ASR tasktranscription=r2t(s2r(audio))
# TTSspeech=r2s(t2r(text))
How this will be implemented
r2t, t2r, r2s, s2r will all inhering nn.module, and implement forward methods that do exactly what their name says. They are generic classes, and the specific implementation should be handled through a yaml config.
ASR, TTS, Ichigo, are pipeline implementations of the fundamental models, and they will also be handled via a yaml config.
There can be built-in configs to achieve one-line implementations, such as for IchigoASR, but users should also be able to just define their own custom config.
Related work
This is fundamentally just extending the ideas of the framework invented by WhisperSpeech
The text was updated successfully, but these errors were encountered:
All these papers that came out after SpeechT5 are enabling us to realize the original vision of SpeechT5. The method and architecture looks very different because of all the new recent advancements but these are all along the same path
Overall Design Pattern
For the speech portion of the framework, we need to design around these 4 primitive models and the quantizer. Our research strategy will also be build around this and the package will thus also follow this structure.
This is not currently how things are designed in existing packages, for example Whisper is a single model.
To fit within our framework, we need to break it into:
In implementation, examples for how this will work is:
How this will be implemented
r2t, t2r, r2s, s2r will all inhering nn.module, and implement forward methods that do exactly what their name says. They are generic classes, and the specific implementation should be handled through a yaml config.
ASR, TTS, Ichigo, are pipeline implementations of the fundamental models, and they will also be handled via a yaml config.
There can be built-in configs to achieve one-line implementations, such as for IchigoASR, but users should also be able to just define their own custom config.
Related work
This is fundamentally just extending the ideas of the framework invented by WhisperSpeech
The text was updated successfully, but these errors were encountered: