How to calculate Macro F1 QALD Score
Closed this issue · 5 comments
Hello,
we (@silvanknecht and me) are trying to recreate some of the QALD-9 results and are now wondering, how the Macro F1 QALD Score is calculated. We know that the precision is 1 if the system gives no answer (instead of precision = 0), but that should not matter for the macro F1 (as it is calculated for every question).
Is the F1 calculated from the global macro precision and recall? Or are we missing some cases where it matters that the precision is 1 even though the recall is 0 and therefore the F1 for that question is 0 aswell?
Thanks and kind regards
Hi,
thanks for using GERBIL QA. In this paper, we describe it a bit more: http://www.semantic-web-journal.net/system/files/swj1838.pdf :
Maybe @TortugaAttack can point you to the code which implements that later.
A bit above the text you sent there is this: "For the macro metric, we calculate the precision, recall and F-measure per question and average these metrics individually at the end. " Is that for Macro F1 QALD different?
Hi
Regarding the code:
QALD F1 Macro calculation
and the respective single precision, recall and f1 measure calculations:
Following the code, it is different, i.e., for the QALD macro F1, macro precision and macro recall are calculated and used to calculate the F1 measure.
I would like to emphasize that this is not one of our ideas. It came from earlier QALD challenges where a script was used for the evaluation. We implemented it only for backwards compatibility. (You can see the complete discussion from #211 (comment) on)
I see, thanks for the fast reply. We have read the issue but were unsure how the changes were implemented. Now we know and can refer to this issue in our work. Thank you.