hipe-eval/HIPE-2022-data

Missing Entities in TopRes19th Dataset

stefan-it opened this issue · 2 comments

Hi,

during review of adding HIPE-2022 dataset into Flair, we just found that some of the listed entites do not exist in the actual dataset.

These entities are: ALIEN, OTHER, FICTION.

Could you please clarify what happened to these entites? Will they be added later (or will they appear in the final test dataset).

Many thanks,

Stefan

Hi Stefan
you are right, these entities were removed. We chose to do it due to their scarcity in the training data. We first thought about keeping OTHER, but then decided for HIPE 2022 to keep it simple, given that all these datasets are already a bit nightmarish. Here are the stats that one can generate from the published data in webanno tsv format:

annotated_tsv $ cat *.tsv | grep -vP '^#'  | cut -f 5| grep -Po '[A-Z]+' |sort |uniq -c | sort -rn
   3470 LOC
    891 BUILDING
    406 STREET
      5 OTHER
      1 FICTION

For the currently private data, the distribution situation is similar and we will not add additional entity types.

Hi @simon-clematide ,

thanks for the explanation! c673e29 mentions is, so I'm closing here 🤗