Question about partial_run #237

igor-yusupov · 2024-06-14T09:36:54Z

igor-yusupov
Jun 14, 2024

I found interesting script for converting whisper model to onnx format: https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/whisper/export-onnx.py

In this script they convert the encoder in another way, instead of audio features, they pull the cache from the decoder: https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/whisper/export-onnx.py#L108

The main advantage is that it reduced the size of the weights by about 30%
I want to try using these weights, will your partial_run optimisation work in this case?

Answered by robertknight

Jun 14, 2024

The partial_run optimization works in any situation where there are model inputs that stay the same across each iteration of the decoder loop, and there are chunks of the graph which depend only on those inputs. For encoder-decoder transformer models this should always include the cross-attention inputs. For this model I think using partial_run with the n_layer_cross_k and n_layer_cross_v inputs should work, but you'll have to try it.

By the way, I recently added a new rten-generate crate to this repo which simplifies using transformer decoder models and applies the various key optimizations (using KV cache, partial_run etc.). See the rten-examples/src/gpt2.rs demo. It won't work for this…

View full answer

robertknight · 2024-06-14T09:56:10Z

robertknight
Jun 14, 2024
Maintainer

The partial_run optimization works in any situation where there are model inputs that stay the same across each iteration of the decoder loop, and there are chunks of the graph which depend only on those inputs. For encoder-decoder transformer models this should always include the cross-attention inputs. For this model I think using partial_run with the n_layer_cross_k and n_layer_cross_v inputs should work, but you'll have to try it.

By the way, I recently added a new rten-generate crate to this repo which simplifies using transformer decoder models and applies the various key optimizations (using KV cache, partial_run etc.). See the rten-examples/src/gpt2.rs demo. It won't work for this Whisper model just yet because it hardcodes the names of some model inputs to what HuggingFace's Optimum tool uses. With a little more work to generalize the API, it should work with other models though. There are also some other improvements coming as part of this GPT-related work that should transfer to Whisper models.

2 replies

igor-yusupov Jun 14, 2024
Author

Thank you. I'll try and come back with feedback :)

igor-yusupov Jun 16, 2024
Author

This implementation of whisper seems to support only recognising a single 30 second chunk without the ability to set a prompt. Probably that's why the size of scales is smaller. I'm not sure about this conclusion yet, but I haven't managed to run their test on long audio yet.

igor-yusupov · 2024-06-16T19:17:43Z

igor-yusupov
Jun 16, 2024
Author

While parsing this repository I had 2 thoughts of how to update whisper example in https://github.com/igor-yusupov/comparisons-rten:

move positional embedding inside the model and take embeddings with sliсes, I think rten supports this? I also think to rewrite the structure of kv cache, and instead of 12 small tensors make 1 and select the necessary elements with sliсes inside the model.
I think of adding other versions of whisper, not just base. would this be useful to you?

21 replies

igor-yusupov Jul 30, 2024
Author

Yeah, the output repeating itself on the second chunk of audio without getting EOS token

What about ARM processors? Can the quantized model run slower on them? It runs about 15% slower on my machine. I see that the FAQ section says it should be faster on ARM, but for some reason, I'm getting the opposite result

igor-yusupov Jul 30, 2024
Author

Nevertheless, sacrificing 15% speed for a 4-fold reduction in model size is quite a good deal :)

robertknight Jul 31, 2024
Maintainer

I'm not as familiar with Arm, but my understanding is that performance depends on the availability of dot product instructions (SDOT, UDOT). These have been available on chips released since ~2018 (including Apple M-series, Arm Cortex A-76 and later).

robertknight Jul 31, 2024
Maintainer

Nevertheless, sacrificing 15% speed for a 4-fold reduction in model size is quite a good deal :)

Yes. The memory size reduction is a big plus, especially once you get into models with 500M+ params.

igor-yusupov Jul 31, 2024
Author

These have been available on chips released since ~2018 (including Apple M-series ...

I use Apple's M-series processor and I even got curious about why the result is like this. It's possible that on these processors, working with fp32 is also highly optimized, while as far as I understand, quantized models perform more operations. I think there is also a possibility that during the conversion, an architecture is created that is more optimized for x86-64, or that the Apple Neural Engine is utilized more when working with fp32.

In general, I'll try to figure it out, and if I find something interesting, I'll come back with feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about partial_run #237

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 23 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about partial_run #237

igor-yusupov Jun 14, 2024

Replies: 2 comments · 23 replies

robertknight Jun 14, 2024 Maintainer

igor-yusupov Jun 14, 2024 Author

igor-yusupov Jun 16, 2024 Author

igor-yusupov Jun 16, 2024 Author

igor-yusupov Jul 30, 2024 Author

igor-yusupov Jul 30, 2024 Author

robertknight Jul 31, 2024 Maintainer

robertknight Jul 31, 2024 Maintainer

igor-yusupov Jul 31, 2024 Author

igor-yusupov
Jun 14, 2024

Replies: 2 comments 23 replies

robertknight
Jun 14, 2024
Maintainer

igor-yusupov Jun 14, 2024
Author

igor-yusupov Jun 16, 2024
Author

igor-yusupov
Jun 16, 2024
Author

igor-yusupov Jul 30, 2024
Author

igor-yusupov Jul 30, 2024
Author

robertknight Jul 31, 2024
Maintainer

robertknight Jul 31, 2024
Maintainer

igor-yusupov Jul 31, 2024
Author