RuntimeError when processing VAD on short audio

Question

RuntimeError when processing VAD on short audio

gaspardpetit opened this issue 5 months ago · 7 comments

Describe the bug

When processing a short audio clip with fewer than sample_rate * large_chunk_size samples (ex. 480000 for 16kHz with large_chunk_size=30, VAD fails trying to access samples out of bound and raise a RuntimeError: Failed to decode audio. exception

Expected behaviour

Processing VAD on short audio clip should not throw an exception

To Reproduce

https://colab.research.google.com/drive/1eHvZPpIdMJNzlDQIkFgrkQZSjVyhkHPU#scrollTo=fDQ0rwDUGYXK

Environment Details

Tested against 0.5.16

Relevant Log Output

Exception has occurred: RuntimeError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Failed to decode audio.
  File "/dev/.venv/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
  File "/dev/.venv/lib/python3.10/site-packages/torchaudio/_backend/ffmpeg.py", line 100, in load_audio
    return torch.ops.torchaudio.compat_load(src, format, filter, channels_first)
  File "/dev/.venv/lib/python3.10/site-packages/torchaudio/_backend/ffmpeg.py", line 336, in load
    return load_audio(os.path.normpath(uri), frame_offset, num_frames, normalize, channels_first, format)
  File "/dev/.venv/lib/python3.10/site-packages/torchaudio/_backend/utils.py", line 204, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
  File "/dev/.venv/lib/python3.10/site-packages/speechbrain/pretrained/interfaces.py", line 1315, in get_speech_prob_file
    large_chunk, fs = torchaudio.load(
  File "/dev/.venv/lib/python3.10/site-packages/speechbrain/pretrained/interfaces.py", line 2121, in get_speech_segments
    prob_chunks = self.get_speech_prob_file(
  File "/dev/verbatim/speaker_diarization/diarize_speakers_speechbrain.py", line 43, in diarize_on_silences
    boundaries = vad_model.get_speech_segments(audio_file)
  File "/dev/verbatim/speech_transcription/transcribe_speech.py", line 271, in execute_for_speaker_and_language
    segments = DiarizeSpeakersSpeechBrain().diarize_on_silences(f"out/{speaker}-{language}.wav")
  File "/dev/verbatim/speech_transcription/transcribe_speech.py", line 324, in execute
    transcription: Transcription = self.execute_for_speaker_and_language(
  File "/dev/verbatim/pipeline.py", line 64, in execute
    transcript = self.transcript_speech.execute(
  File "/dev/tests/test_pipeline.py", line 23, in <module>
    Pipeline(context).execute()
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
RuntimeError: Failed to decode audio.

Additional Context

No response

Answer 1 · 2024-01-13T04:50:19.000Z

This issue had been reported in pyannote (pyannote/pyannote-audio#1515) by someone else, but I did not find it here.

Answer 2 · 2024-01-13T05:02:59.000Z

There was also an "off by one" logic issue that I fixed. An audio clip of exactly sample_rate * large_chunk_size (i.e.. 30s) would also cause the exception to be raised.

Answer 3 · 2024-01-17T15:00:53.000Z

Hello @gaspardpetit,

Thanks for opening this issue.

I did run your colab (thanks for the code!) and indeed, we have an error RuntimeError: Failed to decode audio.. However, when I do reproduce your issue on my compute cluster, I'm not getting an error and I get as output tensor([]).

I will investigate more but I do suspect that the issue is related to google colab... I'll keep you updated.

Answer 4 · 2024-01-18T11:50:59.000Z

Considering the error comes from torchaudio itself I suspect different torchaudio versions/backends might exhibit different behavior in edge cases, and not that the issue stems from a misconfiguration or an upstream bug per se.
In particular, I suspect that unusual frame_offset/num_frames values in the torchaudio.load could cause something like that.

Answer 5 · 2024-01-28T10:43:15.000Z

Considering the error comes from torchaudio itself I suspect different torchaudio versions/backends might exhibit different behavior in edge cases, and not that the issue stems from a misconfiguration or an upstream bug per se. In particular, I suspect that unusual frame_offset/num_frames values in the torchaudio.load could cause something like that.

Interesting.

Could you please @gaspardpetit share with us your pip configuration and your ffmpeg version ?

Answer 6 · 2024-01-28T10:44:14.000Z

Note: It would be great to have in SpeechBrain a script (e.g. get_config.sh), that automatically fetches all the relevant information for us SB devs. What do you think @asumagic ?

Answer 7 · 2024-01-28T13:56:09.000Z

Thanks for looking into this. I doubt this is related to ffmpeg, if you look at the sample on https://colab.research.google.com/drive/1eHvZPpIdMJNzlDQIkFgrkQZSjVyhkHPU#scrollTo=fDQ0rwDUGYXK it uses raw audio and doesn't seem to depend on ffmpeg.

Additionally, the fix https://github.com/speechbrain/speechbrain/pull/2335/files consists in checking if this is the last chunk before processing the chunk rather than after. When done the way it is currently done, the loop will always run twice even if the first chunk would have been the last. There was also an off by one by using > rather than >=. I am more puzzled about why it would work on some versions of torchaudio, since to me the error is clearly in speechbrain/inference/VAD.py