Regarding Score Computation

Question

Regarding Score Computation

Closed this issue 4 years ago · 4 comments

I have some questions regarding the _score method from evaluator.py:

        for (sample_gt, sample_pred) in zip(gt, pred):
            union = set()
            union.update(sample_gt)
            union.update(sample_pred)

            for s in union:
                if s in sample_gt:
                    t = s[2]
                    gt_flat.append(t.index)
                    types.add(t)
                else:
                    gt_flat.append(0)

                if s in sample_pred:
                    t = s[2]
                    pred_flat.append(t.index)
                    types.add(t)
                else:
                    pred_flat.append(0)

Why exactly do you add the prediction and ground truth twice to the flat array in case the relation isn't classified correctly? Since you end up with an array larger than the actual number of relations in the evaluated dataset, wouldn't this penalyze the computed score, since you would have one correct element for every hit, and two wrong elements for every miss?

Thanks!

Answer 1 · 2020-11-13T19:37:00.000Z

Hi,

in our scenario it is indeed possible that the number of predicted relation triples (head, relation, tail) is larger than the actual number of triples, since the model could predict multiple false positive triples. For example, suppose we have the sentence

Douglas Adams was the author of The Hitchhiker's Guide to the Galaxy and The Meaning of Liff.

with the ground truth triples (Douglas Adams, author, The Hitchhiker's Guide to the Galaxy) and (Douglas Adams, author, The Meaning of Liff). Now, suppose the model correctly predicts (Douglas Adams, author, The Hitchhiker's Guide to the Galaxy) but falsely outputs (Douglas Adams, country, The Meaning of Liff). In this case we end up with a true positive, a false positive and a false negative prediction. The same would happen if the false prediction is (was, author, The Hitchhiker's Guide to the Galaxy) instead of (Douglas Adams, country, The Meaning of Liff).

In both cases, we end up with a precision of 0.5 (1 out of 2 predictions correct) and a recall of 0.5 (1 out of 2 ground truth triples predicted correctly). But it is also possible that the model only predicts (Douglas Adams, author, The Hitchhiker's Guide to the Galaxy) and misses (Douglas Adams, author, The Meaning of Liff). In this case, we have a true positive and a false negative prediction, so a precision of 1 and a recall of 0.5.

So, missing a correct triple and falsely predicting multiple other triples is penalized more than just missing the triple. I think this is the desired behavior in joint entity and relation extraction, since results are measured in F1 score on a triple level. Please correct me if I'm wrong or misunderstood your question.

Answer 2 · 2020-11-13T20:50:35.000Z

Hi @markus-eberts , thanks for the reply!

First, let me just emphasize that my main concern here is just classifying relations (I'm not worried about joint entity predictions or classification). Does this nullify this analysis?

I think I understand your explanation, but I'm still having problems understanding how it fits to your code. Here, I was debugging your code:

I think that you aligned ground truth and predictions in the gt and pred arrays (I noticed that some elements in the pred array were empty, which would mean that the model didn't detect those entities, right?). Considering that both arrays are aligned, and that both have 288 elements, then you would have one element for each triple in the ground truth. In your case, you would have 2 elements for (Douglas Adams, author, The Hitchhiker's Guide to the Galaxy) and (Douglas Adams, author, The Meaning of Liff), I suppose.

So, when you process each element in these arrays, in case a triple has an incorrect relation prediction, you throw it twice to the flat array, and it would be computed as a false negative (from the perspective of the actual class) and as a false positive (from the perspective of the predicted class), right? This would explain the size of the flat array.

Do I get it right now?

Thanks!

Answer 3 · 2020-11-13T21:43:16.000Z

First, let me just emphasize that my main concern here is just classifying relations (I'm not worried about joint entity predictions or classification). Does this nullify this analysis?

In standard relation classification (e.g. SemEval-2010 Task 8) the entity mention pair is given. It depends on your task requirements if the method I'm using suits your needs.

I think that you aligned ground truth and predictions in the gt and pred arrays (I noticed that some elements in the pred array were empty, which would mean that the model didn't detect those entities, right?). Considering that both arrays are aligned, and that both have 288 elements, then you would have one element for each triple in the ground truth. In your case, you would have 2 elements for (Douglas Adams, author, The Hitchhiker's Guide to the Galaxy) and (Douglas Adams, author, The Meaning of Liff), I suppose.

No, the gt and pred lists contain all ground truth / predicted triples per document (sentence), i.e. a nested list for each document. Both lists contain 288 entries, i.e. one for each document of your test set (CoNLL04 I suppose). For example, you get the ground truth triples and predicted triples for the first document with gt[0] and pred[0].

(I noticed that some elements in the pred array were empty, which would mean that the model didn't detect those entities, right?)

Then the model detected no triples for this document.

So, when you process each element in these arrays, in case a triple has an incorrect relation prediction, you throw it twice to the flat array, and it would be computed as a false negative (from the perspective of the actual class) and as a false positive (from the perspective of the predicted class), right? This would explain the size of the flat array.

Yes. Assume the correct relation class for a specific entity mention pair is "author" (id=1) but the model predicted "country" (id=2). Then the flat lists look like:

gt_flat = [1, 0] pred_flat = [0, 2]

I'm using the "precision_recall_fscore_support" from sklearn to compute our measurements. Since the 0 label is ignored (I'm not including it in the labels list), the method calculates a false negative and a false positive. This is equal to passing

gt_flat = [1] pred_flat = [2]

instead. So one false negative for the actual class and one false positive for the predicted class.

Answer 4 · 2020-11-13T21:55:43.000Z

OK, great. Makes sense to me now, thanks!