RuntimeError when processing VAD on short audio
gaspardpetit opened this issue · 7 comments
Describe the bug
When processing a short audio clip with fewer than sample_rate * large_chunk_size
samples (ex. 480000 for 16kHz with large_chunk_size=30
, VAD fails trying to access samples out of bound and raise a RuntimeError: Failed to decode audio.
exception
Expected behaviour
Processing VAD on short audio clip should not throw an exception
To Reproduce
https://colab.research.google.com/drive/1eHvZPpIdMJNzlDQIkFgrkQZSjVyhkHPU#scrollTo=fDQ0rwDUGYXK
Environment Details
Tested against 0.5.16
Relevant Log Output
Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Failed to decode audio.
File "/dev/.venv/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
return self._op(*args, **kwargs or {})
File "/dev/.venv/lib/python3.10/site-packages/torchaudio/_backend/ffmpeg.py", line 100, in load_audio
return torch.ops.torchaudio.compat_load(src, format, filter, channels_first)
File "/dev/.venv/lib/python3.10/site-packages/torchaudio/_backend/ffmpeg.py", line 336, in load
return load_audio(os.path.normpath(uri), frame_offset, num_frames, normalize, channels_first, format)
File "/dev/.venv/lib/python3.10/site-packages/torchaudio/_backend/utils.py", line 204, in load
return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
File "/dev/.venv/lib/python3.10/site-packages/speechbrain/pretrained/interfaces.py", line 1315, in get_speech_prob_file
large_chunk, fs = torchaudio.load(
File "/dev/.venv/lib/python3.10/site-packages/speechbrain/pretrained/interfaces.py", line 2121, in get_speech_segments
prob_chunks = self.get_speech_prob_file(
File "/dev/verbatim/speaker_diarization/diarize_speakers_speechbrain.py", line 43, in diarize_on_silences
boundaries = vad_model.get_speech_segments(audio_file)
File "/dev/verbatim/speech_transcription/transcribe_speech.py", line 271, in execute_for_speaker_and_language
segments = DiarizeSpeakersSpeechBrain().diarize_on_silences(f"out/{speaker}-{language}.wav")
File "/dev/verbatim/speech_transcription/transcribe_speech.py", line 324, in execute
transcription: Transcription = self.execute_for_speaker_and_language(
File "/dev/verbatim/pipeline.py", line 64, in execute
transcript = self.transcript_speech.execute(
File "/dev/tests/test_pipeline.py", line 23, in <module>
Pipeline(context).execute()
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
RuntimeError: Failed to decode audio.
Additional Context
No response
This issue had been reported in pyannote (pyannote/pyannote-audio#1515) by someone else, but I did not find it here.
There was also an "off by one" logic issue that I fixed. An audio clip of exactly sample_rate * large_chunk_size
(i.e.. 30s) would also cause the exception to be raised.
Hello @gaspardpetit,
Thanks for opening this issue.
I did run your colab (thanks for the code!) and indeed, we have an error RuntimeError: Failed to decode audio.
. However, when I do reproduce your issue on my compute cluster, I'm not getting an error and I get as output tensor([])
.
I will investigate more but I do suspect that the issue is related to google colab... I'll keep you updated.
Considering the error comes from torchaudio
itself I suspect different torchaudio versions/backends might exhibit different behavior in edge cases, and not that the issue stems from a misconfiguration or an upstream bug per se.
In particular, I suspect that unusual frame_offset
/num_frames
values in the torchaudio.load
could cause something like that.
Considering the error comes from
torchaudio
itself I suspect different torchaudio versions/backends might exhibit different behavior in edge cases, and not that the issue stems from a misconfiguration or an upstream bug per se. In particular, I suspect that unusualframe_offset
/num_frames
values in thetorchaudio.load
could cause something like that.
Interesting.
Could you please @gaspardpetit share with us your pip configuration and your ffmpeg version ?
Note: It would be great to have in SpeechBrain a script (e.g. get_config.sh), that automatically fetches all the relevant information for us SB devs. What do you think @asumagic ?
Thanks for looking into this. I doubt this is related to ffmpeg, if you look at the sample on https://colab.research.google.com/drive/1eHvZPpIdMJNzlDQIkFgrkQZSjVyhkHPU#scrollTo=fDQ0rwDUGYXK it uses raw audio and doesn't seem to depend on ffmpeg.
Additionally, the fix https://github.com/speechbrain/speechbrain/pull/2335/files consists in checking if this is the last chunk before processing the chunk rather than after. When done the way it is currently done, the loop will always run twice even if the first chunk would have been the last. There was also an off by one by using >
rather than >=
. I am more puzzled about why it would work on some versions of torchaudio, since to me the error is clearly in speechbrain/inference/VAD.py