hipe-eval/HIPE-scorer

Evaluation measures: Slot error rates

simon-clematide opened this issue · 3 comments

@maud What would be the expected benefits for this evaluation if we

  • already do entity-level evaluations (not IOB-token-level evaluations)?
  • allow fuzzy matches?

paper: https://pdfs.semanticscholar.org/451b/61b390b86ae5629a21461d4c619ea34046e0.pdf

@simon-clematide +pinging @mromanello and @aflueckiger

I think we should consider 2 questions/points:

(1) capacity to provide fine-grained evaluation reports to participants.

  • this is not so useful if everybody uses deep learning, since one cannot really tune features.
  • this information is however already at hand in the eval script and the report could be produced. I will share an example on slack.

(2) SER, i.e. the capacity to weight differently types of mistakes (penalizing more type errors than boundary errors, and even more entities where there is both type and boundary errors).
If the fine-grained report is done, then SER is just another measure combining things a bit differently.

What can we gain out of SER? I think a little better understanding of why systems are wrong, but not dramatically more infos.

  • deletion (false negatives): I think this will be the majority of mistakes for all systems, and this aspect is already captured by Recall
  • insertion (false positives): already broadly captured by P
  • type substitution: already broadly captured by P, would give an indication of type confusability.
  • boundary substitution: will already be broadly captured by fuzzy vs. exact
  • type and boundary substitution: super wrong predictions

Overall we could leave out SER, but detailed eval report could be useful.

With the eval script we can get very nuanced error reports. It has a very agnostic basis and aggregates numbers on different levels (also type confusion, which we won't use for the official ranking).

Thus I suggest to drop SER.

can be closed