marianne-m/brouhaha-vad

Sliding window at inference time

Closed this issue · 7 comments

I'm afraid the sliding window at inference time will make things a bit confusing for users as beginning and end frames are 'missing'.
Would be great if the inference could align the output to the input by repeating the first and last frames N times such that audio_duration_in_ms / nb_output_frames = 16 ms (brouhaha frame duration)

I don't think this is a thing right now, right ?

PS: would be great to add a short description of the output in the README too!

Can you clarify what is missing exactly? Sounds more like a pyannote.audio bug than a Brouhaha bug.

Oh I've just checked and there's no such problem!
We do have (audio_duration_in_ms) / (nb_output_frames) = 16 ms !
I guess audio is padded by default at inference time :) Sorry!

Closing

Hello! I have a similar question and would like to know if I missed a step (which is very likely because I am not used to manipulating audio files).

I ran the command to apply the model (python main.py apply --data_dir path/to/data --out_dir path/to/predictions --model_path models/best/checkpoints/best.ckpt --ext "wav") and checked the results as follow:

import librosa
import numpy as np

out_dir = '/outdir'
audio_dir = '/audiodir'
file = 'xxx'

d = librosa.get_duration(filename=f'{audiodir}/{file}.wav')
n = len(np.load(f'{ourdir}/detailed_snr_labels/{file}.npy'))

print(f'duration: {d}s.')
print(f'number of frames: {n}')
print(f'frame rate: {d/n}')

I get the following output on 5 different audio files:

duration: 3.85s.
number of frames: 229
frame rate: 0.016812227074235808

duration: 8.64s.
number of frames: 513
frame rate: 0.016842105263157894

duration: 4.52s.
number of frames: 269
frame rate: 0.016802973977695167

duration: 11.52s.
number of frames: 684
frame rate: 0.016842105263157894

duration: 16.44s.
number of frames: 975
frame rate: 0.016861538461538463

duration: 6.34s.
number of frames: 377
frame rate: 0.016816976127320953

I use brouhaha v.0.9.0 and pyannote-audio v.2.1.1. Should I open an issue in the pyannote repo?
Thanks in advance!

Hi Leonie,

Thanks for your message!
Everything looks fine to me: You do obtain the expected frame duration of ~16.8ms.
For a file of duration 3.85s, you obtain 3850 / 16.8 = 229 frames.
For a file of duration 6.34s, you obtain 6340 / 16.8 = 377 frames.

If you're not used to manipulating audio files: let me try to be a bit clearer.
For each piece of audio (16.8 ms), the model returns whether there's speech, the SNR estimation as well as the C50 estimation.
Hence why for a 3.85s-long audio file, you end up with 229 predictions. The first prediction aligns with your first 16.8ms-long chunk of audio, the second prediction aligns with your second 16.8ms-long of audio, etc.

Maybe the confusion arises from the fact that our frame duration is approximately 16.8ms -- which is determined by the SincNet architecture --, more standard durations are 10 or 20ms. But this works basically the same!

Let me know if there's anything that remains unclear!

EDIT: @LeonieBorne would love to know the kind of data/problem you're trying to hack with Brouhaha!

Hi Marvin,

Thank you very much for these detailed explanations, it is already much clearer!

If I understood correctly, SincNet analyzes audio chunks of (exactly?) 16.8ms one after the other and the final chunk is not analyzed if it is < 16.8ms, which explains that (duration / number of frames) does not give exactly 16.8ms. In this case, do you know why duration - (number of frames * 16.8ms) is not always less than 16.8ms? For example: 16.44-975*16.8ms = 60.0ms

I ask this question because I am helping to organize a challenge (Unsupervised Adaptation for Speech Enhancement) on CHiME-5 data and I ran Brouhaha to check the manual transcriptions. To provide the results to the participants of the challenge, I would like to diarize the SNR and VAD results with the same framerate and I am not sure how to match the VAD (in the rttm files) to the SNR outputs.

@LeonieBorne why don't you drop me an email and we can schedule a quick Zoom call as this would make things easier to solve... Also say hi to @mpariente ;-)

I just dit it, thanks!!