Question regarding F1 evaluation metric
Opened this issue · 0 comments
atreyasha commented
I would like to ask a question regarding the F1 evaluation metric used in your paper (similar to #3). The paper mentions that the "average of the maximum F1 from each n−1 subset" is used to calculate the F1 metric. I am slightly unsure as to how this works, but think it could mean the following:
-
For each classification output, compare the predicted label against the labels from the annotators. Compute the maximum F1 per sample (which should be the same as accuracy), as shown in the example below:
Sample Predicted Label Ann1 Ann2 Ann3 Maximum F1 1 Relevant Irrelevant None Irrelevant 0 2 Relevant Relevant Relevant Relevant 1 3 Irrelevant None Irrelevant Relevant 1 -
Take the average of all maximum F1 scores: (0 + 1 + 1)/3 = 2/3 =~ 0.67
Is my understanding of the evaluation metric correct?
Thank you for your time.