about the output of the apply

Question

about the output of the apply

dalaolili opened this issue 4 months ago · 2 comments

I got "SPEAKER 2b533492-bfa4-49f3-b01a-32546f6044bf_2 1 3.473 4.134 A " by run the command of "python brouhaha/main.py apply", but I can't understand the meaning of all the columns, so can you give me some message about them?
What's more, how can I read the .npy file?
Sincerely waiting for your reply.

Answer 1 · 2024-08-06T13:17:19.000Z

I will try to give some leads, although I did not participate in the development so take my comments with a grain of salt.

For your first question, I suggest you look into the description of the RTTM file format(Annex A page 12) if you need details about the exact meaning of each column but as an overview it should be:

SPEAKER : the model identified speech from somebody (all lines here will be SPEAKER)
2b533492-bfa4-49f3-b01a-32546f6044bf_2 : name of the audio file where speech was found
1 : channel (here it should always be 1 as the model works with mono channel)
3.473 : timecode in seconds of where the speech was identified
4.134 : duration of the speech
A : label of the speaker, it should always be A here as I think the model is not trained to differentiate between speakers

So your line tells you speech was detected in file 2b533492-bfa4-49f3-b01a-32546f6044bf_2.wav from time 3.473s to time 7.607s (3.473 + 4.134), the rest is not really relevant.

You can read .npy files by using numpy in python:

import numpy as np
snr = np.load('detailed_snr_labels/2b533492-bfa4-49f3-b01a-32546f6044bf_2.npy')

The content should be snr values for each frame. Frames have a duration of 16.875 ms #14 (comment)

Answer 2 · 2024-08-08T06:08:53.000Z

get it！thanks for your reply！

…

---- Replied Message ---- | From | Loann ***@***.***> | | Date | 08/06/2024 21:17 | | To | ***@***.***> | | Cc | ***@***.***>***@***.***> | | Subject | Re: [marianne-m/brouhaha-vad] about the output of the apply (Issue #23) | I will try to give some leads, although I did not participate in the development so take some of it with a grain of salt. For your first question, I suggest you look into the description of the RTTM file format if you need details about the exact meaning of each column but as an overview it should be: SPEAKER : the model identified speech from somebody (all lines here will be SPEAKER) 2b533492-bfa4-49f3-b01a-32546f6044bf_2 : name of the audio file where speech was found 1 : channel (here it should always be 1 as the model works with mono channel) 3.473 : timecode in seconds of where the speech was identified 4.134 : duration of the speech A : label of the speaker, it should always be A here as I think the model is not trained to differentiate between speakers So your line tells you speech was detected in file 2b533492-bfa4-49f3-b01a-32546f6044bf_2.wav from time 3.473s to time 7.607s (3.473 + 4.134), the rest is not really relevant. You can read .npy files by using numpy in python: importnumpyasnpsnr=np.load('detailed_snr_labels/2b533492-bfa4-49f3-b01a-32546f6044bf_2.npy') The content should be snr values for each frame which have durations of 16.875 ms #14 (comment) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>