Question about partial_run #237
-
I found interesting script for converting whisper model to onnx format: https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/whisper/export-onnx.py In this script they convert the encoder in another way, instead of audio features, they pull the cache from the decoder: https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/whisper/export-onnx.py#L108 The main advantage is that it reduced the size of the weights by about 30% |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 23 replies
-
The By the way, I recently added a new |
Beta Was this translation helpful? Give feedback.
-
While parsing this repository I had 2 thoughts of how to update whisper example in https://github.com/igor-yusupov/comparisons-rten:
|
Beta Was this translation helpful? Give feedback.
The
partial_run
optimization works in any situation where there are model inputs that stay the same across each iteration of the decoder loop, and there are chunks of the graph which depend only on those inputs. For encoder-decoder transformer models this should always include the cross-attention inputs. For this model I think usingpartial_run
with then_layer_cross_k
andn_layer_cross_v
inputs should work, but you'll have to try it.By the way, I recently added a new
rten-generate
crate to this repo which simplifies using transformer decoder models and applies the various key optimizations (using KV cache,partial_run
etc.). See therten-examples/src/gpt2.rs
demo. It won't work for this…