AbhilashaRavichander/PrivacyQA_EMNLP

Question regarding F1 evaluation metric

Opened this issue · 0 comments

Hi @AbhilashaRavichander,

I would like to ask a question regarding the F1 evaluation metric used in your paper (similar to #3). The paper mentions that the "average of the maximum F1 from each n−1 subset" is used to calculate the F1 metric. I am slightly unsure as to how this works, but think it could mean the following:

  1. For each classification output, compare the predicted label against the labels from the annotators. Compute the maximum F1 per sample (which should be the same as accuracy), as shown in the example below:

    Sample Predicted Label Ann1 Ann2 Ann3 Maximum F1
    1 Relevant Irrelevant None Irrelevant 0
    2 Relevant Relevant Relevant Relevant 1
    3 Irrelevant None Irrelevant Relevant 1
  2. Take the average of all maximum F1 scores: (0 + 1 + 1)/3 = 2/3 =~ 0.67

Is my understanding of the evaluation metric correct?

Thank you for your time.