-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training run: Ichigo Speech Tokenizer #144
Comments
|
![]()
Current PR (WIP): janhq/WhisperSpeech#8 |
Problem
Solution
Implementation
![]()
Preview table of comparison
Preview table of comparison
Preview table of comparison
After all experiments, we concluded that with duplicate 512 codebook init checkpoint with noises to remain values of weight, concat multiple short audios up to 30s, trained on mixed datasets, turn off KL loss, we returned the best results. |
Phase 1: kl (distillation loss) + ce loss |
Problem
Solution
Implementation
WER comparison
![]() Preview table of comparison (ckpt phase 2)
Preview table of comparison (ckpt phase 1, better)
|
Errors in data samplings
|
Next test run:
What to validate
Only after validated the above points we can move forwards with next steps cc @tuanlda78202 TBD today |
Training model on high-quality datasetsProblem
Solution
ResultsPhase 1 (with KL loss)Training on viVoice (868k samples in jan-hq) and LibriTTS-R (112k samples), not use concat 30s dataset, early stopping if accuracy during 10 epochs does not improve # Implementation of accuracy metric for early stopping
def _update_validation_metrics(self, logits, output_toks):
valid_toks = output_toks != -100
self.val_true += (
(logits.detach().argmax(-1)[valid_toks] == output_toks[valid_toks])
.float()
.sum()
)
self.val_total += valid_toks.float().sum()
def get_metrics(self):
metrics = {
"acc_0": (self.val_true / self.val_total).item(),
}
self.val_true[:] = 0
self.val_total[:] = 0
return metrics Result on
|
Exp ID | Number of samples | Best epoch | Training time | Accuracy | Loss |
---|---|---|---|---|---|
p1-vivoice+librittsr | 10000 | 29 | 2d 12h 29m 37s | 0.89 | 14.59 |
Visualization Loss & Accuracy on val
phase
|
|
Testing
Summary results
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER |
---|---|---|---|---|---|---|---|
Ichigo Quantizer | 1 | 10 | Bud500 + LibriTTS-R | LibriTTS-R | En | 4689 | 0.56 |
Ichigo Quantizer | 1 | 29 | viVoice + LibriTTS-R | LibriTTS-R | En | 4689 | 0.13 |
Ichigo Quantizer | 2 | 100 | Bud500 + LibriTTS-R | LibriTTS-R | En | 4689 | 1.90 |
Ichigo Quantizer | 2 | 10 | Bud500 + LibriTTS-R | LibriTTS-R | En | 4689 | 0.22 |
PhoWhisper Large | - | - | - | LibriTTS-R | En | 4689 | 0.47 |
Whisper Medium | - | - | - | LibriTTS-R | En | 4689 | 0.12 |
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER |
---|---|---|---|---|---|---|---|
Ichigo Quantizer | 1 | 29 | viVoice + LibriTTS-R | viVoice | Vi | 10000 | 0.21 |
PhoWhisper Large | - | - | - | viVoice | Vi | 10000 | 0.23 |
Whisper Medium | - | - | - | viVoice | Vi | 10000 | 0.18 |
- LibriTTS-R
|
|
- viVoice
|
|
Phase 2 (without KL loss)Phase 1 (29e), Phase 2 10e
prompt = f"You are a professional transcriber, fluent in {prefix_lang}. You are listening to a recording in which a person is potentially speaking {prefix_lang}, and no other languages. They may have a strong accent. You are to transcribe utterances of {prefix_lang} accordingly"
|
Phase 1 full epoch (100e), Phase 2 10e [Ongoing]Phase 1
|
Bug: Incorrect Mask Generation After Audio PaddingDescriptionDuring code review, we discovered that the mask generation after audio padding is incorrectly implemented. The current code creates masks with all 1s for padded audio. Given the padded audio with the padding value = 0:
# Current problematic code
concatenated_audio = self.pad_audio(concatenated_audio)
mask = torch.zeros(30 * 16000 // 320, dtype=torch.bool)
audio_frames = min(len(concatenated_audio), self.max_audio_length) // 320
mask[:audio_frames] = 1 # Bug: This includes padding tokens
# Impact on training
if self.training and self.config.mask_embs and mask is not None:
x[~mask] = project_out(self.rq.layers[0]._codebook.embed[0, self.vq_codes]) Impact
Next Steps
This is solved by this PR : janhq/WhisperSpeech#19. |
[Add PAD tokens] IchigoWhisper Phase 1 50e, Phase 2 20ePhase 1
Phase 2 (Ongoing)
|
[Merge Codebooks] IchigoWhisper Phase 1 50e, Phase 2 5eHow to merge?![]()
# 1. Initial State
Codebook 512: [512 codes + 1 mask token]
[C1 C2 C3 ... C512 M]
Codebook 2048: [2048 codes + 1 mask token]
[D1 D2 D3 ... D2048 M]
# 2. Remove Mask Token from 512
Codebook 512 (without mask):
[C1 C2 C3 ... C512] # 512 codes
Codebook 2048 (keeps mask):
[D1 D2 D3 ... D2048 M] # 2049 codes
# 3. Create New Empty Codebook
New Size = 512 + 2049 = 2561 codes
[_ _ _ ... _ _ _] # 2561 empty slots
# 4. Merge Process
Step 2: Copy 2048+mask first
[D1 D2 D3 ... D2048 M | _ _ _ ... _ _ _ _ ]
|----2049 codes----| |-----512 slots-----|
Step 2: Copy 512 codes after
[D1 D2 D3 ... D2048 M | C1 C2 C3 ... C512 |]
|----2049 codes----| |-----512 codes-----| Experiments
|
I updated the Readme for this run. Several information I needed to fill in
|
Ichigo Whisper (non-cut PAD, phase 1 50e, full test dataset)viVoice (10000)
LibriTTS-R (4689)
|
Training Spec
|
Ichigo Whisper (non-cut PAD, phase 1 50e)First 1000 Test samplesviVoice (1000)
LibriTTS-R (1000)
|
Goal
Tuning to find the best params for Ichigo Quantizer
Hypothesis
Task:
Training Result:
The training result here: #146
The text was updated successfully, but these errors were encountered: