tazz4843/whisper-rs

Parallel Processing and Performance

chriskyndrid opened this issue · 4 comments

Admittedly this is more of a question than an issue per say....

It appears from reviewing issue like this and this , some work was done to create some separation between the context and state, specifically motivated by an inquiry on parallel processing. Reviewing this commit it appears functions like whisper_init_no_state, whisper_init_no_state, etc we're added to accommodate this separation. It also appears your bindings are using these as well.

From experimenting with this crate, I don't find any particular appreciable(marginal) benefit from running multiple State's in parallel across the same context. In my environment I have the cuda feature flag enabled, and I'm using rayon to make parallel calls that each create a new State. Before this I'm working with streamed audio (read from videos by hooking gstreamer as it transcodes). I'm specifically calculating a desired byte size to get as close to a target number of seconds before I initialize the speech to text recognition, as I've found the larger the size(say close to 30 second chunks), generally, the faster the recognition(vs more smaller chunks of audio). Coupled with some fixes for deduplication(due to overlapping of audio chunks), etc, this is working very well.

What I have found is that creating(at the expense of memory) additional context's like this:

pub static WHISPER_CONTEXT: Lazy<Vec<WhisperContext>> = Lazy::new(|| {
   let model_path = "./include/ai/whisper/model/ggml-large.bin";
   let num_instances = 2; // Change this to how many instances you want

   (0..num_instances)
       .map(|_| WhisperContext::new(&model_path).expect("Failed to load whisper model"))
       .collect::<Vec<WhisperContext>>()
});

and then fetching one in the thread that is performing the recognition like this:

let index = WHISPER_CONTEXT_INDEX.fetch_add(1, AOrdering::SeqCst) % WHISPER_CONTEXT.len();
let whisper_context = &WHISPER_CONTEXT[index];

Results in improving the performance by nearly double when going from a single context to 2. In my case fiddling with the num_threads, block size, etc, makes little difference(after say num_threads=4), and creating multiple context's makes all the difference.

Have you done any experimentation on this? Am I misinterpreting the commits in whipser.cpp on sharing a single context across multiple threads?

Thanks for any info you might have!

As far as I know, the original reason that states were created was for saving memory by not having to load a new model each and every time a user wanted to use it at the same time. I have not done any experiments like this.

I'm specifically calculating a desired byte size to get as close to a target number of seconds before I initialize the speech to text recognition, as I've found the larger the size(say close to 30 second chunks), generally, the faster the recognition(vs more smaller chunks of audio).

whisper.cpp (and the original whisper implementation) both pad audio to 30 seconds, which is a limitation of the underlying model if I remember correctly.

What I'm wondering here is how multiple contexts improve performance so drastically for you. It could be some sort of hardware limitation that adding contexts bypasses, but I haven't even thought of possibly using multiple contexts over states, as most of the places I use whisper are memory bound. If you could write up some sort of minimal example, I could dive into this a bit deeper.

Thanks, good to know on the padding side of things. I'll do some more experimentation on it, and I was surprised it was so much faster(on my machine) creating a second context(as this is something I wanted to avoid). If it continues to be reliably reproducible I'll cook up a sample program for it and post it so you can see if you get the same results.

Any updates here?

@tazz4843, sorry for the delay. I haven't had a chance to work up a separate example that mimics the workflow in my main application. At least in my case I DO continue to see a performance improvement by running parallel context. My dev environment:

  1. Dev machine: CUDA enabled with whisper, running on an RTX 4090, with a 24 core processor cpu and 128GB ram.
  2. Producer/Consumer threads reading and processing audio segments pulled from a gstreamer fakesink over a crossbeam channel, running denoise, vad, etc operations prior to submitting chunks for inference. Engine supports VOSK or Whisper depending on config.
  3. The inference threads run inference in parallel across however many context's have been defined.

I'll go ahead and close the ticket for now. If I get a chance to create an isolated example later I'll post it.