Error of fa calculation
hangtingchen opened this issue · 5 comments
The falarm value is wrong when the falarm occurs at the begining of the rttm file. The following is an example:
"""
(enh)# cat ref.rttm
SPEAKER bkwns 1 10.00000 35.320000 spk01
SPEAKER bkwns 1 48.160000 0.800000 spk00
(enh)# cat sys.rttm
SPEAKER bkwns 1 0.00 45.12 spk01
SPEAKER bkwns 1 45.12 0.4 spk00
(enh)]# ./md-eval-22.pl -r ref.rttm -s sys.rttm
command line (run on 2022 Oct 21 at 17:17:48) Version: 22 ./md-eval-22.pl -r ref.rttm -s sys.rttm
Time-based metadata alignment
Metadata evaluation parameters:
time-optimized metadata mapping
max gap between matching metadata events = 1 sec
max extent to match for SU's = 0.5 sec
Speaker Diarization evaluation parameters:
The max time to extend no-score zones for NON-LEX exclusions is 0.5 sec
The no-score collar at SPEAKER boundaries is 0 sec
Exclusion zones for evaluation and scoring are:
-----MetaData----- -----SpkrData-----
exclusion set name: DEFAULT DEFAULT DEFAULT DEFAULT
token type/subtype no-eval no-score no-eval no-score
(UEM) X X
LEXEME/un-lex X
NON-LEX/breath X
NON-LEX/cough X
NON-LEX/laugh X
NON-LEX/lipsmack X
NON-LEX/other X
NON-LEX/sneeze X
NOSCORE/ X X X X
NO_RT_METADATA/ X
SU/unannotated X
*** Performance analysis for Speaker Diarization for ALL ***
SCORED SPEAKER TIME =36.120000 secs
MISSED SPEAKER TIME =0.800000 secs
FALARM SPEAKER TIME =0.200000 secs
SPEAKER ERROR TIME =0.200000 secs
OVERALL SPEAKER DIARIZATION ERROR = 3.32 percent of scored speaker time `(ALL)
---------------------------------------------
Speaker type confusion matrix -- speaker weighted
REF\SYS (count) unknown MISS
unknown 1 / 50.0% 1 / 50.0%
FALSE ALARM 1 / 50.0%
---------------------------------------------
Speaker type confusion matrix -- time weighted
REF\SYS (seconds) unknown MISS
unknown 35.32 / 97.8% 0.80 / 2.2%
FALSE ALARM 0.20 / 0.6%
---------------------------------------------
"""
The falarm time is 0.2s . However sys.rttm contains spk01 from 0-10s, which is not included in the falarm time. The correct falarm time should be 10.2s.
This is an issue in the underlying implementation by NIST's tool. For consistency with how NIST has scored things, I won't touch this. However, if you want a more correct implementation, I'd recommend using pyannote's implementation.
But it seems like dscore implements it correctly here, or am I missing something?
Looks like I replied in haste originally. There is not in fact a bug in that antiquated Perl script. Since no UEM was supplied, the tool defaulted to [min_ref_onset, max_ref_offset]
(as pointed on by @desh2608 in the linked code snippet). In this case [10, 48.96]
. Which is why the initial 10 seconds of speech from the system RTTM is not scored. To get the desired behavior, supply a UEM that marks the entire recording for scoring.
I think the antiquated Perl script is indeed wrong. It gets UEM only from the reference turns, which would underestimate false alarms as pointed out by the OP. Your tool, however, would compute the DER correctly since it estimates UEM from both the ref and hyp turns.
Argh. Reminder to self to not comment when exhausted. For some reason, I was assuming that the original poster was commenting on behaviour of dscore
and not md-eval-22.pl
itself.
When called directly on these RTTM files with no UEM, md-eval-22.pl
will induce one from REF turns ONLY, resulting in an optimistic assessment:
*** Performance analysis for Speaker Diarization for ALL ***
SCORED SPEAKER TIME =36.120000 secs
MISSED SPEAKER TIME =0.800000 secs
FALARM SPEAKER TIME =0.200000 secs
SPEAKER ERROR TIME =0.200000 secs
OVERALL SPEAKER DIARIZATION ERROR = 3.32 percent of scored speaker time `(ALL)
dscore
, however will induce a UEM spanning the ENTIRE recording, which when passed to md-eval-22.pl
yields the expected result:
*** Performance analysis for Speaker Diarization for ALL ***
SCORED SPEAKER TIME =36.120000 secs
MISSED SPEAKER TIME =0.800000 secs
FALARM SPEAKER TIME =10.200000 secs
SPEAKER ERROR TIME =0.200000 secs
OVERALL SPEAKER DIARIZATION ERROR = 31.01 percent of scored speaker time `(ALL)
I assume @hangtingchen was calling md-eval-22.pl
directly to access the DER sufficient statistics and/or confusion matrices (which we really should have output for DIHARD, but didn't because the tables were already getting busy). In which case the solution is to be not rely on the NIST script induced UEM.