-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors noticed after extensively testing Zipformer model #1465
Comments
Did you train this on your own data, and do you have Conformer-based baselines that don't show these issues, or at least that the issues are significantly less frequent in? It would be interesting to see some kind of numerical comparison. Also are you comparing across just using different types of encoder (Conformer vs Zipformer), or were there changes in the loss function and decoding method? |
Yes. Both conformer and Zipformer models are trained on same dataset (my own) which is inherently quite noisy and challenging. And yes Conformer does not suffer from all 3 mentioned categories of errors but in spite of these errors, over all accuracy of Zipformer is slightly better than Conformer's accuracy. Category (a) error are least, category (c) error happens often, but category (b) error is the most critical and puzzling as those segment sometimes have quite decent and audible pronunciation of word(s) and the model would simply not decode the one or more word(s). Rare occurrence is when entire audio segment (say around 4-5 second long) won't generate any transcript at all. |
@kafan1986 Could you please also answer this question? |
@csukuangfj @kafan1986 I am experiencing the same thing especially the b) part. In my case, yes everything is the same. |
I also mostly see excessive deletions in both offline and streaming model, not just words, complete phrases or parts of phrases are getting ignored. The streaming version very often first correctly recognizes what is being said, then deletes it. |
For the deletion errors, have you guys tried the fix in #1447 |
@csukuangfj I am using the sherpa-onnx online websockets for my tests, latest master, which i think is not affected ? |
If you could take examples that have problematic deletions and shift the input by 1 or 2 or 4 frames, and see whether the deletions still appear, that would be interesting. i'm wondering whether the lack of complete invariance to frame-shifting could be part of the issue. |
I did a quick test by removing a second from the (longer file) with audacity, but the same phrase was still deleted. (with 32-256) - the missing phrase is 9 seconds long. I then tried something else: (this was just a quick test one 1 sample where I know I had a problem) |
Similar here, in streaming mode I see many deletions with 64-256 zipformers. 16-64 zipformer is much better. It also helps to delete silence chunks between phrases. My models are trained with musan augmentation, it seems that search never leaves blank. |
I also have the impression that sometimes things get worse after a period of silence. (when testing live on the webdemo) |
@nshmyrev these are just differences in decoding settings you are comparing, right, not different models? |
It would be good to vary the chunk size & context independently to see whether one or the other is more responsible for the differences. |
@danpovey correct, model is the same, just different onnx export with different chunk and context settings. The chunk size seems to have the bigger impact for me at first glance, although both make a difference, I will do some more tests a bit later. |
Right, same model, just different chunk sizes after export to onnx. I'll prepare more detailed test a bit later, wanted to convert librispeech to streaming test to showcase this. |
I suspect that in your training sets, some of the data had largish deletions in the transcript; and the model learned: "if the left-context seems to be wrong, continue to output nothing until we get to a silence." |
@danpovey i think that might indeed be the issue., I have found such data before and tried to filter the, but i'm sure there will be such cases left. Unfortunately I won't be able to implement that change, but I could test it if somebody else can. |
For me yes, the 1-2 frames shift could suddenly improve or make WER worse by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd consider this beyond the random noise. This is with absolutely same model/decoder/everything, I just pad features with zero frames (log zero so -20.0 pad value in reality). I also found that you can rebalance deletions and insertions to make deletions rarer by using this
blank_penalty ?
|
I haven't tried to shift the frame but the blank_penalty does improve quite a bit the WER. |
We could maybe try a heuristic of increasing the blank penalty as number of
successive blanks rises. It's a bit ugly but might help address deletions
after silences.
Maybe the model never sees long silence in training so gets confused. We
could try occasionally adding segments with lots of silence, perhaps, in
training.
…On Thu, Jan 18, 2024, 12:50 AM Erwan Zerhouni ***@***.***> wrote:
If you could take examples that have problematic deletions and shift the
input by 1 or 2 or 4 frames, and see whether the deletions still appear,
that would be interesting. i'm wondering whether the lack of complete
invariance to frame-shifting could be part of the issue.
For me yes, the 1-2 frames shift could suddenly improve or make WER worse
by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd
consider this beyond the random noise. This is with absolutely same
model/decoder/everything, I just pad features with zero frames (log zero so
-20.0 pad value in reality).
I also found that you can rebalance deletions and insertions to make
deletions rarer by using this blank_penalty
https://github.com/k2-fsa/icefall/blob/7bdde9174c7c95a32a10d6dcbc3764ecb4873b1d/egs/librispeech/ASR/zipformer/streaming_beam_search.py#L75
that just subtructs some number from the log brob that corresponds to
blank. But this introduces another hyper-parameter that you'll have to tune
against your domain and looks more like a hack rather than the cure of the
core reason. I wonder if anyone else tried to tweak this blank_penalty?
I haven't tried to shift the frame but the blank_penalty does improve
quite a bit the WER.
—
Reply to this email directly, view it on GitHub
<#1465 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZ6G7QLEHPJ3IXALB3YO76MDAVCNFSM6AAAAABB55O72OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJWGIYDMMBWGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Also it would be nice to check that the issue is not specific to
Sherpa/onnx.
…On Thu, Jan 18, 2024, 1:34 AM Daniel Povey ***@***.***> wrote:
We could maybe try a heuristic of increasing the blank penalty as number
of successive blanks rises. It's a bit ugly but might help address
deletions after silences.
Maybe the model never sees long silence in training so gets confused. We
could try occasionally adding segments with lots of silence, perhaps, in
training.
On Thu, Jan 18, 2024, 12:50 AM Erwan Zerhouni ***@***.***>
wrote:
> If you could take examples that have problematic deletions and shift the
> input by 1 or 2 or 4 frames, and see whether the deletions still appear,
> that would be interesting. i'm wondering whether the lack of complete
> invariance to frame-shifting could be part of the issue.
>
> For me yes, the 1-2 frames shift could suddenly improve or make WER worse
> by ~0.1-0.3% absolute on my test dataset where I have ~10.0% WER, so I'd
> consider this beyond the random noise. This is with absolutely same
> model/decoder/everything, I just pad features with zero frames (log zero so
> -20.0 pad value in reality).
>
> I also found that you can rebalance deletions and insertions to make
> deletions rarer by using this blank_penalty
>
>
> https://github.com/k2-fsa/icefall/blob/7bdde9174c7c95a32a10d6dcbc3764ecb4873b1d/egs/librispeech/ASR/zipformer/streaming_beam_search.py#L75
>
> that just subtructs some number from the log brob that corresponds to
> blank. But this introduces another hyper-parameter that you'll have to tune
> against your domain and looks more like a hack rather than the cure of the
> core reason. I wonder if anyone else tried to tweak this blank_penalty?
>
> I haven't tried to shift the frame but the blank_penalty does improve
> quite a bit the WER.
>
> —
> Reply to this email directly, view it on GitHub
> <#1465 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZFLOZ6G7QLEHPJ3IXALB3YO76MDAVCNFSM6AAAAABB55O72OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJWGIYDMMBWGU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
@danpovey I've tried with a dataset with long silences within the transcripts, but it didn't resolve the issue. |
I have used Efficient conformer for testing conformer ASR modelhttps://github.com/burchim/EfficientConformer and it gives better accuracy than vanilla conformer implementation. The loss used was RNNT for efficient conformer model. (The param size was ~10.7M), for the Zipformer I used default small configuration for training and decoding. (~23M param model) |
What is the value of blank_penalty that you have used? And what was the WER with and with out this blank_penalty during your testing? As you have mentioned there was significant improvement with this. |
@danpovey The main issue i am seeing on the live web demo, is that : My datasets contain a lot of silence before after and in the middle of transcripts, could it be that the model learnt that if a part is silence, the rest is most likely too ? I intentionally trained on data with a lot of silence to reduce the model from outputting very common short words during silence. (like yes, or hi) |
@joazoa I'd be more concerned about whether there were chunks of speech without a corresponding transcription in your dataset. Regarding this: |
I confirm that greedy search works a lot better, thank you! @danpovey |
我曾经也遇到了这些问题,尤其是:1、长时间静音后突然识别,开头的几个字基本识别不正确;2、会漏掉一些单词。然而我清晰知道这些问题的原因,最原来的zipformer是有pooling层的,我当时做过实验,把pooling去掉,这两个现象就会发生,但是加上pooling层后,这些现象就会消失。问题在于,我发现后面更新的zipformer,和原始的zipformer结构略微不同,特别是新的zipformer原生没有pooling层,我不知道你们的原因是不是这个。 English: |
Relevant paper from Google on RNN-T on high deletion rate RNN-T MODELS FAIL TO GENERALIZE TO OUT-OF-DOMAIN AUDIO: CAUSES AND SOLUTIONS |
Another issue from Nvidia TDT paper https://arxiv.org/abs/2304.06795 WER for repeated tokens goes as high as 60% We notice that RNN-T models often suffer serious performance degradation when the text sequence has repetitions of the |
We're looking into the feasibility of switching over from RNN-T to TDT. |
I'm facing similar issues after training Zipformer2 with Commonvoice (including "other" cuts) for Spanish, I got WER % of less than 5 after 50 epochs and in general works pretty well. But sometimes seems "confused" and gets no output for >30s (not sure how to explain it, I'll try to record a video). Tried using I'm trying now to train Zipformer with CTC attention head, any other thing that I can try? |
My guess is that, if it's not giving outputs for >30s input, this may be a generalization issue due to the training having very short utterances, i.e. nothing approaching 30 seconds. Perhaps you could try concatenating some CommonVoice utterances and their transcripts together, or including some longer utterances from some other source. By the way, I think we found that TDT used too much memory to work well with our training setups, since our pruned RNN-T is quite optimized for memory usage and we use large batch sizes. |
@danpovey I have opposite experience. My training data has all utterances less than 20seconds. During inference, I have used segments upto 57 seconds and the accuracy is quite good if not better than shorter segments (less than 5-6 seconds). Regarding your feedback with TDT, does it help with better OOV and deletion error? |
@kafan1986 when you say >30s, do you means samples over 30s or are you using streaming on longform audio ? |
Non-streaming variant during inference time. Entire audio segment in one go. |
I have extensively used Zipformer model (both streaming and non-streaming variant) and I have noticed the following errors. The test has been done with greedy search and as well as higher beam size values but no LM. The error below are for the non-streaming variant which should reach highest accuracy.
a) Sometimes incorrect prediction at the start as well as at the end of the audio segment
b) Sometime clearly audible words get deleted (not predicted at all) - Most critical and occurrence count is significant
c) Two separate words gets conjoined sometimes, either with their correct spelling ("good morning" => "goodmorning") or the conjoined word have some weird spelling ("good morning" => "goodorning")
Overall accuracy is slightly ahead of the "conformer" ASR model. The main advantage of Zipformer is training speed up. There is atleast a 5x speed up in training time requirement compared to conformer. Also, zipformer ASR model seems to be more phonetic and can do a better job when predicted out-of-vocabulary words.
If the error a), b) and c) reported above can be improved, specially b) then Zipformer model can be state of the art for its size.
The text was updated successfully, but these errors were encountered: