Consideration of <PAD> token while evaluating NER tasks

Question

Consideration of <PAD> token while evaluating NER tasks

NeeharikaGupta opened this issue 3 years ago · 4 comments

While preparing data for the models, we generally use a padded token to establish a fixed length structure. While this seems natural while implementing any model, I would like to know what does one do while evaluating it. It may so happen that it comes between B and I tag of an entity and then what do we do about the evaluation criteria. The CoNLL evaluation script does not consider padded tokens and I found this library that does similar computation. But I would like to know how to deal with padded tokens which may occur anywhere in the text in the evaluation phase?

Answer 1 · 2022-03-13T22:23:59.000Z

If <PAD> is included in the predicted sequences, seqeval raises UserWarning: <PAD> seems not to be NE tag.. So you need some post processings.

Answer 2 · 2022-03-14T05:10:36.000Z

The seqeval raises this warning. The conll script has no warning or errors raised while processing such kind of file.
I would like to know what people usually do in such cases. Do they convert all padded tokens (except the true positives) to the IOB scheme Other token or some other post processing is generally practiced ?

Answer 3 · 2022-03-15T06:30:36.000Z

One way is to convert pad token to O tag. Another way is removing it from the evaluation. Or replace it to some BI token(e.g. B-Type1 <PAD> I-Type1 -> B-Type1 I-Type1 I-Type1).

Answer 4 · 2022-03-15T06:54:51.000Z

Thanks a lot. This answers everything.