Evaluate NER
shizhediao opened this issue · 1 comments
I was wondering how to produce macro average F1 from the code?
Thanks!
Thanks to @kyleclo , this question has been solved and closed.
PICO is unusual because we don't want to count one of the classes when doing macro F1. That is, take a raw average of the reported metrics: F1-I-INT, F1-I-OUT, F1-I-PAR, which are the relevant PICO tags. As for the others, F1-measure-overall and average-F1 are the two metrics to look at. F1-measure-overall corresponds to a span-level F1 measure so it's used for all the sequence tagging tasks (e.g. NER). average-F1 is used for the text/relation classification tasks. The only exception is Chemprot is typically reported using micro-F1, which is computationally equivalent to 'accuracy', which is one of the Allenlp metrics.