inconsistent evaluation results using different output format for DOA estimation
thomeou opened this issue · 1 comments
Hi Sharah,
Thank you very much for uploading the baseline.
I would like to ask about the evaluation metrics. It seems to me that the code produce different scores for different DOA format and how often the metric is updated.
Here are the results from the output produced by your baseline system on my computer. This is the evaluation result for the test fold 1. I did convert the format of the file accordingly. Please help me verified on your side. There is a also a bug in cls_new_metric.update_seld_scores. Which way would the final metrics will be computed for the competition?
Results on test split:
1. use cls_new_metric.update_seld_scores_xyz update ONE time as in the baseline
DCASE2020 Scores:
Class-aware localization scores: DOA Error: 26.7575, F-score: 60.53
Location-aware detection scores: Error rate: 0.8034, F-score: 26.14
SELD (early stopping metric): 0.5213
2. use cls_new_metric.update_seld_scores_xyz update for each output file (total 100 updates for fold 1)
DCASE2020 Scores
Class-aware localization scores: DOA Error: 26.3260, F-score: 60.53
Location-aware detection scores: Error rate: 0.8007, F-score: 26.55
SELD (early stopping metric): 0.5190
3. use cls_new_metric.update_seld_scores update ONE time
DCASE2020 Scores: polar update one time
Class-aware localization scores: DOA Error: 24.3700, F-score: 60.53
Location-aware detection scores: Error rate: 0.7298, F-score: 36.09
SELD (early stopping metric): 0.4747
4. use cls_new_metric.update_seld_scores update each output file (total 100 updates for fold 1)
DCASE2020 Scores:
Class-aware localization scores: DOA Error: 23.6153, F-score: 60.53
Location-aware detection scores: Error rate: 0.7269, F-score: 36.55
SELD (early stopping metric): 0.4718
The same inconsistency happened for DCASE 2019 scores
1. use evaluation_metrics.compute_doa_scores_regr_xyz as in the baseline code
DCASE2019 Scores using regr_xyz
Localization-only scores: DOA Error: 23.3281, Frame recall: 65.53
Detection-only scores:Error rate: 0.5435, F-score: 60.58
2. evaluation_metrics.compute_doa_scores_regr
DCASE2019 Scores
Localization-only scores: DOA Error: 20.2878, Frame recall: 65.53
Detection-only scores:Error rate: 0.5435, F-score: 60.58
Thank you very much.
Hi @thomeou thanks for your questions, you were right, there was a bug in the metric code, we had missed out on normalizing the cartesian vectors before computing the angular distance. We have now fixed it in the recent commit. So the problem you had with different scores for polar and Cartesian outputs should be solved. This bug-fix also changes the baseline scores which we have updated in the repo and website.
Regarding the minor mismatch between a) the scores computed from the 100 DCASE output files, and b) the actual output of the method during training which computes the metrics in one shot. Among the two, for the final evaluation of the participating methods we will be using the approach a) which is the correct approach.
The current disparity in b) arises because of the way the data generator data_gen_test is created, in order to have only the features/labels of one file in each batch of data, we zero pad the batch. However, when computing the final metrics, instead of removing these zero pads and computing, we compute along with the zero-padded values, which might result in bad segment framing, hence a different score from ideal.
I have now pushed a new file - calculate_dev_results_from_dcase_output.py
that computes the score in the way we would be computing the score on your submitted files. You can use this script to check your test scores. The script also has a flag to switch between polar and Cartesian way of computing. Ideally, you should get the exact same results irrespective of the format.