BUTSpeechFIT/VBx

AMI inconsistent RTTM

Closed this issue · 2 comments

Hi,

I am trying to follow your pipeline for AMI diarization. However, I notice that there are different versions of the RTTM reference labels. Some of them are very different (>50% DER). I wonder if you could tell me their difference, please?

For example, data/AMI_Mix-Headset/rttms/test/EN2002a.rttm, data/AMI_beamformed/rttms/test/EN2002a.rttm, AMI-diarization-setup/only_words/rttms/test/EN2002a.rttm, AMI-diarization-setup/only_words/rttms/test/EN2002a.rttm, AMI-diarization-setup/word_and_vocalsounds/rttms/test/EN2002a.rttm are all diarization labels of the same meeting (denoted as ref 1, 2, 3, 4). I think they should be very similar. However, I found that ref1 and ref2 are similar (In my opinion, they should be identical). ref3 and ref4 are similar (that's understandable). But ref1 and ref3 are very different (>30% DER)

I wonder if you have any ideas about why ref1 and ref3 are so different? And if I am trying to follow AMI_run.sh (mix-headset), which RTTM file I should use as a ground truth label?

Your help is greatly appreciated!

Best,
Zili

Hi Zili,
All the rttms in the data folder: https://github.com/BUTSpeechFIT/VBx/tree/master/data are the outputs of the models so they are not reference RTTMs.
The reference RTTMs are in https://github.com/BUTSpeechFIT/AMI-diarization-setup and, as you pointed out, there are two flavors: "only_words" and "word_and_vocalsounds". These two differ in the criterion used to define the reference: either only take words or also add segments marked as vocal sounds. They are similar, as you said. For reporting results in our publications we always used the only_words version.

The differences with respect to the other versions are basically the errors. Both AMI_Mix-Headset and AMI_beamformed are the outputs of the models and the error in some files can be high. For the file EN2002a, in AMI_beamformed (where the total test error is 20.84 DER) it is 35.68 and for AMI_Mix-Headset (where the total test error is 18.99 DER) it is 34.60. That is the file with the second highest error while some other files have error below 10 DER. I have not analyzed that file in particular but it makes sense that some of them are more difficult than others and also that this method will have more problems with some recordings than with others.

I hope this helps
Federico

Hi Federico,

Thanks for your answer! I mistakenly think https://github.com/BUTSpeechFIT/VBx/tree/master/data as ground truth labels. Given that they are the system outputs, I think there is no problem. Really appreciate your help!

Best,
Zili