SYSTRAN/faster-whisper

Updated benchmarks please!

Opened this issue ยท 11 comments

I noticed that the latest benchmark from whisper.cpp is from February 2, 2023 and presumably the others are equally outdated. Can someone please do updated benchmarks for the three backends tested? A year and a half old seems too much and is kind of insulting towards this repository as well a the others in terms of the hard work that everyone has done to improve the code base. Thanks!

I guess there is no candidate to do this better than you ๐Ÿ˜

x86Gr commented

@BBC-Esq if could can do that, it would be greatly appreciated

It's been over a year since I benchmarked the original whisper implementation. Suppose I could revisit that old code...

@x86Gr I thought I had posted this on faster-whisper or maybe it was a @MahmoudAshraf97's fork...can't remember where, but I did do this benchmarking several months ago...

SAMPLE BENCHMARK IMAGE

image

I could redo something like this...or if faster-whisper wanted something a little less flashy, I could limit it to 2-3 markdown tables like it currently is.

Before I dedicate time to this, let me know your thoughts.

I propose benching:

  • vanilla openai whisper
  • faster-whisper
  • transformers - (which insanely faster whisper uses)
  • WhisperS2T
  • WhisperX
  • whisper.cpp
  • maybe tensorrt if I can get it running

Let me know what you think of the following plan:

  • bench 1-2 regular whisper and their corresponding distil-whisper versions
  • bench on cuda - float32 and float16
  • bench on cpu - float32
  • I don't think whisper.cpp has int8 support so that's why I'm shying away from int8 in this table.
  • bench beam size 1 and 5...but I'd need to verify that whisper.cpp has something like a beam size parameter
  • don't use flash attention 2
  • no batching
  • include word error rate

We could have a second table that benches int8 and the other quants just for faster-whisper...but honestly it's a little much for me to compare it with other backends. I'd have to get into bitsandbytes + transformers (8-bit and 4-bit), etc. That's why I suggest just float32 and float16 for the table comparing the various implementations

I think we can exclude WhisperS2T and WhisperX because they are essentially the same as faster-whisper, all three use CT2 backend and the differences will not be worth the hassle. TensorRT example code works perfectly for building and I have inference code for it, you can get in touch with me if you needed any help with that.

An important aspect on benchmarking whisper in general is that the speed is proportional to the WER, for example, if the model misses half the tokens, it will generate twice as fast, so a speed figure without WER figure is misleading.

I suggest benchmarking short form and long form datasets to cover all use cases in WER (librispeech and YT Commons respectively)
on GPU, we should use beam size 1 and 5 because FW has an advantage here in memory usage and speed, this will demonstrate that you can get superior WER with the same memory and compute. there is no need for quantization because the largest whisper model is already relatively small

I'd suggest the following:

  1. bench librispeech and YT Commons
  2. no need for float32, the weights are originally in float16 so nothing to gain here
  3. use beam size 1 and 5
  4. batching in all cases, the batching inference is superior to the sequential in almost all backends in WER and speed, use a different batch size for each backend to satisfy a memory condition, for example, increase the batch until the usage reaches a maximum of 8GB, some backends will have lower memory usage which allows larger batches resulting in more speed, or use the same batch size in all of them and demonstrate the difference in memory usage
  5. Use flash attention if possible, we want to show each backend at its best, this is not a scientific benchmark where we need to unify the conditions, we want actual use cases

as for CPU, I guess we should use quants here, at least in whisper.cpp and FW, because the CPU use case is mostly resource constrained

I'm willing to help with running and writing code, but I have an 8GB gpu and that might be a hassle to run different setups with

I have access to A100 80G card and if you want me to run the test, I'm willing to help you running the codes.

I've sort of self-branded myself as a tribune of benchmarks. I've posted (what I think are) helpful graphs on the repositories for WhisperS2T, WhisperX, whisper.cpp, etc. to help the developers and users get a clear idea of the speed. However, I've always been careful to add disclaimers out of respect for the hard work people put into their respective libraries...for example.

"These graphs represent speed in tokens per second and vram usage, which are but two factors to consider when deciding whether to use a particular library. For example, although ctranslate2/faster-whisper has a faster implementation of Whisper models, you may very well decide to use the transformers implementation due to the "relatively" large ecosystem and community. In contrast, ctranslate2/faster-whisper has fase lower vram usage when using higher beam sizes, which, in turn, reduces the word error rate. Thus if WER is of paramount importance you may want to use it instead of the broader transformers community."

This is just one example of how I've tried to couch my benchmarks. Overall, as long as the benchmarks are done tastefully and accurately I'd love to participate and wouldn't object to omitting WhisperX or WhisperS2T, although I'd still continue to do my own comparing all implementations that I feel are true competitors. What I absolutely DO NOT want to participate in is what the "insanely fast whipser" library repository did, which is put annoying bullet train pictures and misleading comparisons to make itself look good - "I'm the fastest, look at me look at me!" If you want a full history of my gripes regarding that, you can sift through the "issues" section on that repo for my posts...which, in hindsight, were probably somewhat overly-emotional and I'm surprised they didn't get removed entirely.

With that being said,

Regarding @MahmoudAshraf97 's comments...

"TensorRT example code..."

  • Awesome, as long as you don't mind walking me through some of it I'd love to learn!

"...for example, if the model misses half the tokens, it will generate twice as fast, so a speed figure without WER figure is misleading."

  • Same, help me understand how to test this as well as I've only tested vram and speed thus far!

"...short form and long form..."

  • Agreed!

"...beam size 1 and 5..."

  • Agreed.

"...the weights are originally in float16..."

  • In response to my specific request on huggingface, @sanchit-gandhi added float32 versions, which, technically, is the original precision I believe. You can see an example of an original float32 that I converted here: https://huggingface.co/ctranslate2-4you/whisper-large-v3-ct2-float32
  • However, if there's only a .01% WER difference then I'm okay with only testing float16, for example. If it's 1-5% or more (you get the picture) then I think float32 is good to test. I haven't tested WER so perhaps the difference with float16 is so miniscule that it doesn't matter...let me know in your expert opinion as someone more familiar...

"...batching in all cases"

  • Yes to batching, but would like to include non-batching as well. Reasons: (1) Wouldn't the CPU tests all be without batching? (2) Might some users not want to use batching? (3) It could serve as a baseline?

..."until the usage reaches a maximum of 8GB"

  • Like this idea. At a certain point larger batches do not increase tokens/second while VRAM continues to increase dramatically. My testing has determined that this is the point where "core" usage is maxed.
  • I'd suggest taking it a step further and including a chart like this that I created for WhisperS2T. This allows users to scan across the 8GB line, for example, to see which batch size is best for them based on the size of the Whisper model.
SAMPLE CHART

image

"...Use flash attention if possible..."

  • Yes...but I've had trouble getting flash attention2 working on Windows while using the transformers library. As long as someone will help. Plus, are we testing on Windows, Linux? This is the only wheel I've found that works on Windows...
  • Either way, we should test with the same version of torch, FA2, and other libraries if possible.

"...we want actual use cases..."

  • Agreed. The latest steam survey shows the RTX 3060 as the most used card. I only have an RTX 4090...
  • Respectfully, I'd like to test on a consumer GPU and not an A100/H100 because that's what most users would find helpful...
  • As mentioned above, the bottleneck is CUDA core saturation no longer VRAM due to significant improvement to faster-whisper and other implementations across the board. Due to this, the optimal batch size for a 3060 would be far less than for a 4090. See this chart of a comparison of CUDA core count comparison. We'd just need to ensure all testing was done on the same GPU:
SAMPLE CORE COUNT COMPARISON

image

..."as for CPU, I guess we should use quants here, at least in whisper.cpp and FW..."

  • Someone would have to walk me through testing on CPU, but no objection.

I'm willing to contribute an RTX 4090 and 13900k and share scripts and work with you all. Thanks.

For visibility: https://github.com/huggingface/open_asr_leaderboard

Perhaps use the same benchmarking scripts/data that they do there?

I just skimmed the repo and it looks pretty cool actually if we wanted to use that or some variation I'm down.

EDIT - on second thought, I just re-read the repo and this is what they say to use:

"Note: All evaluations were run using an NVIDIA A100-SXM4-80GB GPU, with NVIDIA driver 560.28.03, CUDA 12.6, and PyTorch 2.4.0."

I think it'd be better to come up with a similar approach but not strictly use this repo...and then test a realistic graphics card.

My advice will be to use it as much as possible if there is no a good reason for it.
If you want to extend this with other GPUs or even CPUs open an issue and talk to the authors. Seems more a configuration issue. I actually contributed recently to this repo to fix a few CTranslate2 topics with some PRs and they're very welcoming.

My advice will be to use it as much as possible if there is no a good reason for it. If you want to extend this with other GPUs or even CPUs open an issue and talk to the authors. Seems more a configuration issue. I actually contributed recently to this repo to fix a few CTranslate2 topics with some PRs and they're very welcoming.

Did you mean to say "...unless there is no good reason for it"?

@MahmoudAshraf97 thoughts on my unreasonably long-ass message and/or what @jordimas is suggesting? The original idea was to update the benchmarks on faster-whisper's readme file...I personally don't feel any allegiance to to using @sanchit-gandhi 's benchmarking rubric to do that...Although I did spend a few hours yesterday understanding it...and it does seem like something I might participate in in my other free time.

I have a different opinion, the HF benchmark is useful for comparing WER across multiple model, but we are trying to benchmark the same model on different engines, so we can write much simpler code that focuses on what we are trying to measure, I'll open a PR in couple of hours in your private repo with my progress so far