bug: transcribing with medium model
b0xtch opened this issue · 4 comments
OS: Mac Ventura
Seems like with the tiny model, transcription works, but when using the medium you get a buffer size error. Perhaps we could do chunking
Running `target/release/whisper audio.wav medium`
thread 'main' panicked at 'wgpu error: Validation Error
Caused by:
In Device::create_bind_group
Buffer binding 0 range 212439040 exceeds `max_*_buffer_binding_size` limit 134217728
', /Users/botch/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.17.0/src/backend/direct.rs:3056:5
stack backtrace:
0: rust_begin_unwind
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
1: core::panicking::panic_fmt
at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
2: core::ops::function::Fn::call
3: <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_bind_group
4: <T as wgpu::context::DynContext>::device_create_bind_group
5: wgpu::Device::create_bind_group
6: burn_wgpu::context::base::Context::execute
7: burn_wgpu::kernel::index::select::select
8: burn_tensor::tensor::ops::modules::base::ModuleOps::embedding
9: whisper::model::Whisper<B>::forward_decoder
10: whisper::main
Update
Using a six-minute audio file with the tiny model produces the same issue.
Chunking is the next planned feature. Right now it clips audio to around the first 30 seconds for the encoder, but the decoder sequence length isn't limited so it will overflow if it doesn't detect the end by the 30 second mark.
Rudimentary chunking is now implemented. Your long audio files should now work, although there is some minor transcription inaccuracy around the chunk edges. I tried incorporating the last few tokens from the previous chunk into whisper to remedy the chunk edge issues but then whisper severely repeats itself and stops predicting the end of chunks so I had to revoke that change. Any ideas why whisper is so finicky when exposed to tokens from the previous chunk?
Mint
whisper is so finicky when exposed to tokens from the previous chunk
sounds like Whisper hallucination it happens in other implementations as well. I would have to dig into this one...