marianne-m/brouhaha-vad

about the output of the apply

dalaolili opened this issue · 2 comments

I got "SPEAKER 2b533492-bfa4-49f3-b01a-32546f6044bf_2 1 3.473 4.134 A " by run the command of "python brouhaha/main.py apply", but I can't understand the meaning of all the columns, so can you give me some message about them?
What's more, how can I read the .npy file?
Sincerely waiting for your reply.

I will try to give some leads, although I did not participate in the development so take my comments with a grain of salt.

For your first question, I suggest you look into the description of the RTTM file format(Annex A page 12) if you need details about the exact meaning of each column but as an overview it should be:

  • SPEAKER : the model identified speech from somebody (all lines here will be SPEAKER)
  • 2b533492-bfa4-49f3-b01a-32546f6044bf_2 : name of the audio file where speech was found
  • 1 : channel (here it should always be 1 as the model works with mono channel)
  • 3.473 : timecode in seconds of where the speech was identified
  • 4.134 : duration of the speech
  • A : label of the speaker, it should always be A here as I think the model is not trained to differentiate between speakers

So your line tells you speech was detected in file 2b533492-bfa4-49f3-b01a-32546f6044bf_2.wav from time 3.473s to time 7.607s (3.473 + 4.134), the rest is not really relevant.

You can read .npy files by using numpy in python:

import numpy as np
snr = np.load('detailed_snr_labels/2b533492-bfa4-49f3-b01a-32546f6044bf_2.npy')

The content should be snr values for each frame. Frames have a duration of 16.875 ms #14 (comment)