Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TensorRT-LLM Whisper Tiny.en/Base.en/Small.en/Medium.en/Large-v1/Large-v2 Models #44

Open
Deep-unlearning opened this issue Nov 26, 2024 · 3 comments

Comments

@Deep-unlearning
Copy link

Hello,

As the TensorRT-LLM Whisper-large-v3 Whisper-large-v3-turbo are ready, I wanted to add the remaining Whisper models.

To convert the checkpoint, I followed and used the scripts provided here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper

And to run evals, I reused the scripts in https://github.com/huggingface/open_asr_leaderboard/tree/main/tensorrtllm

The results I had are significantly worse than what It should be, the results below are with weights in float16:

Filtering models by id: tiny.en


Results per dataset:


whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 99.98 %, RTFx = 658.40
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 99.69 %, RTFx = 1719.86
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 100.08 %, RTFx = 1538.20
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 99.49 %, RTFx = 1760.50
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 99.77 %, RTFx = 1570.98
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 99.93 %, RTFx = 2171.81
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 99.71 %, RTFx = 1926.79
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 99.78 %, RTFx = 2215.32
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 21.67 %, RTFx = 308.40
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 19.81 %, RTFx = 689.10
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 14.89 %, RTFx = 654.89
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 4.60 %, RTFx = 739.03
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 8.06 %, RTFx = 669.27
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 11.02 %, RTFx = 866.02
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 9.75 %, RTFx = 795.86
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 9.61 %, RTFx = 930.41
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 21.54 %, RTFx = 93.02
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 20.38 %, RTFx = 199.15
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 12.62 %, RTFx = 176.35
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 4.88 %, RTFx = 200.96
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 7.38 %, RTFx = 184.10
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 9.83 %, RTFx = 230.22
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 14.30 %, RTFx = 201.43
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 9.20 %, RTFx = 249.95


Composite Results:


whisper_tiny.e/whisper_tiny.en: WER = 99.80 %
whisper_tiny.e/whisper_tiny.en: RTFx = 1760.43
whisper_small.e/whisper_small.en: WER = 12.43 %
whisper_small.e/whisper_small.en: RTFx = 732.23
whisper_large/v2: WER = 12.52 %
whisper_large/v2: RTFx = 198.94


I tried with bfloat16 but did not increased the performance.

Are there specific considerations when converting/evaluating smaller models (e.g., tiny, base, small) for TensorRT-LLM?

@Deep-unlearning
Copy link
Author

cc @Vaibhavs10

@yuekaizhang
Copy link
Contributor

@yuekaizhang
Copy link
Contributor

Fix #46.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants