BUTSpeechFIT/VBx

what are ref.rttm and sys.rttm meaning?

DTDwind opened this issue · 5 comments

Hello~

In the first, I think that ref.rttm is reference rttm (ground truth) and sys.rttm is diarization system output.

But recently I find that maybe something is wrong....

What are they really meaning?

Yes, they mean that. What problem did you find?

I run md-eval.pl -r ami_exp/out_dir_AHC+VB/ref.rttm -s ami_exp/out_dir_AHC+VB/sys.rttm and get DER 0.19.

And then I use the ground truth rttm from kaldi and run md-eval.pl -r kaldi.rttm -s sys.rttm. The DER is 19.07.

Finally, I run md-eval.pl -r kaldi.rttm -s ref.rttm. The DER is 18.99. It is same with paper!

Hello, I am not really sure of what is your question, there is some information missing to understand what you find to be wrong.

First of all, I don't know if those rttm files contain info of a single audio file or a collection of them. I will assume they correspond to a single file.

Secondly, I assume that your "ground truth from kaldi" corresponds to some kaldi system output, that you tried to use as a reference. It would be strange to have real ground truth references that differ so much (ours vs kaldi's).

So I am guessing that your question is why do you get 0.19 DER error when comparing our ref vs sys, but the difference between kaldi vs ref and kaldi vs sys is only 0.08 (19.07-18.99). If that is not what your question is, please clarify what it is.

To answer that, the first thing to consider is whether you are using a collar in the evaluation. If a collar is being used, these are imposed around the speaker change points of the reference file and, therefore, different reference files result in discarding different segments from evaluation, which would lead to differences like the ones you are showing, or even higher ones. Still, md-eval.pl does not use a collar by default, and you did not reflect any in the command lines you shared, so (unless you modified the md-eval.pl script) that would not be the reason for such difference in this case.

Considering that the collar is not being used, such differences can be a result of the usage of the Hungarian algorithm. DER relies on it for the mapping of speaker labels between system and reference files. Depending on the reference and the system files it can result in different mappings, as it will always find that which minimizes the DER. Therefore, comparisons considering different files as references can lead to such numbers.

As a simplistic example, consider the following made up rttm files (where ref2 could correspond to your "kaldi.rttm")

ref.rttm
SPEAKER FILE 1 0.00 10.0 A
SPEAKER FILE 1 10.00 10.00 B

sys.rttm
SPEAKER FILE 1 0.00 12.00 1
SPEAKER FILE 1 12.00 8.00 2

ref2.rttm
SPEAKER FILE 1 0.00 9.50 1
SPEAKER FILE 1 9.50 1.00 2
SPEAKER FILE 1 10.50 9.50 3

If you cross evaluate them:

-r ref.rttm -s sys.rttm results in 10% DER

-r ref2.rttm -s ref.rttm results in 5% DER

-r ref2.rttm -s sys.rttm results in 12.5% DER

After a little analysis on how the mapping is being made it is easy to understand those numbers.

I hope this helps, let us know otherwise.

Here is my rttm file. kaldi.rttm ref.rttm sys.rttm

You can download to see more details.

The kaldi.rttm is ground truth references which kaldi system output.

Both the ref.rttm and sys.rttm are VBx system output in experiment directory(out_dir_AHC+VB).

It is strange to have real ground truth references that differ so much (18.99% DER).

And your paper report AMI VBx DER is 18.99%. It's an amazing coincidence.

So I assume that the ref.rttm actually is the VBx system diarization output, and it is not a ground truth references.

Or I misunderstood something......

I know what happened......

I set RTTM_DIR=data/AMI_Mix-Headset/rttms/test

But those files are not ground truth, they are paper results!

Sorry for takes your time and thank you very much!