We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello,
As the TensorRT-LLM Whisper-large-v3 Whisper-large-v3-turbo are ready, I wanted to add the remaining Whisper models.
To convert the checkpoint, I followed and used the scripts provided here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper
And to run evals, I reused the scripts in https://github.com/huggingface/open_asr_leaderboard/tree/main/tensorrtllm
The results I had are significantly worse than what It should be, the results below are with weights in float16:
Filtering models by id: tiny.en
Results per dataset:
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 99.98 %, RTFx = 658.40 whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 99.69 %, RTFx = 1719.86 whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 100.08 %, RTFx = 1538.20 whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 99.49 %, RTFx = 1760.50 whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 99.77 %, RTFx = 1570.98 whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 99.93 %, RTFx = 2171.81 whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 99.71 %, RTFx = 1926.79 whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 99.78 %, RTFx = 2215.32 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 21.67 %, RTFx = 308.40 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 19.81 %, RTFx = 689.10 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 14.89 %, RTFx = 654.89 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 4.60 %, RTFx = 739.03 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 8.06 %, RTFx = 669.27 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 11.02 %, RTFx = 866.02 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 9.75 %, RTFx = 795.86 whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 9.61 %, RTFx = 930.41 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 21.54 %, RTFx = 93.02 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 20.38 %, RTFx = 199.15 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 12.62 %, RTFx = 176.35 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 4.88 %, RTFx = 200.96 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 7.38 %, RTFx = 184.10 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 9.83 %, RTFx = 230.22 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 14.30 %, RTFx = 201.43 whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 9.20 %, RTFx = 249.95
Composite Results:
whisper_tiny.e/whisper_tiny.en: WER = 99.80 % whisper_tiny.e/whisper_tiny.en: RTFx = 1760.43 whisper_small.e/whisper_small.en: WER = 12.43 % whisper_small.e/whisper_small.en: RTFx = 732.23 whisper_large/v2: WER = 12.52 % whisper_large/v2: RTFx = 198.94
I tried with bfloat16 but did not increased the performance.
Are there specific considerations when converting/evaluating smaller models (e.g., tiny, base, small) for TensorRT-LLM?
The text was updated successfully, but these errors were encountered:
cc @Vaibhavs10
Sorry, something went wrong.
@Deep-unlearning sorry, I just hard coding the large-v3 tokenizer here https://github.com/huggingface/open_asr_leaderboard/blob/main/tensorrtllm/run_eval.py#L29. However, it should be https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L346-L356
Fix #46.
No branches or pull requests
Hello,
As the TensorRT-LLM Whisper-large-v3 Whisper-large-v3-turbo are ready, I wanted to add the remaining Whisper models.
To convert the checkpoint, I followed and used the scripts provided here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper
And to run evals, I reused the scripts in https://github.com/huggingface/open_asr_leaderboard/tree/main/tensorrtllm
The results I had are significantly worse than what It should be, the results below are with weights in float16:
Filtering models by id: tiny.en
Results per dataset:
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 99.98 %, RTFx = 658.40
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 99.69 %, RTFx = 1719.86
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 100.08 %, RTFx = 1538.20
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 99.49 %, RTFx = 1760.50
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 99.77 %, RTFx = 1570.98
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 99.93 %, RTFx = 2171.81
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 99.71 %, RTFx = 1926.79
whisper_tiny.e/whisper_tiny.en | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 99.78 %, RTFx = 2215.32
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 21.67 %, RTFx = 308.40
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 19.81 %, RTFx = 689.10
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 14.89 %, RTFx = 654.89
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 4.60 %, RTFx = 739.03
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 8.06 %, RTFx = 669.27
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 11.02 %, RTFx = 866.02
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 9.75 %, RTFx = 795.86
whisper_small.e/whisper_small.en | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 9.61 %, RTFx = 930.41
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 21.54 %, RTFx = 93.02
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 20.38 %, RTFx = 199.15
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 12.62 %, RTFx = 176.35
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 4.88 %, RTFx = 200.96
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 7.38 %, RTFx = 184.10
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 9.83 %, RTFx = 230.22
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 14.30 %, RTFx = 201.43
whisper_large/v2 | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 9.20 %, RTFx = 249.95
Composite Results:
whisper_tiny.e/whisper_tiny.en: WER = 99.80 %
whisper_tiny.e/whisper_tiny.en: RTFx = 1760.43
whisper_small.e/whisper_small.en: WER = 12.43 %
whisper_small.e/whisper_small.en: RTFx = 732.23
whisper_large/v2: WER = 12.52 %
whisper_large/v2: RTFx = 198.94
I tried with bfloat16 but did not increased the performance.
Are there specific considerations when converting/evaluating smaller models (e.g., tiny, base, small) for TensorRT-LLM?
The text was updated successfully, but these errors were encountered: