hipe-eval/HIPE-2022-data

Minor issues in AJMC v2.0

Closed this issue · 2 comments

Hi,

I've just written some testcases for reading the v2.0 version of the corpus in Flair, and it seems that there are some issues for AJMC:

  • HIPE-2022-v2.0-ajmc-train-de.tsv: In line 16.537 the token ἄνδοα starts with a leading whitespace (very minor issue). Leading spaces also appear in other AJMC splits.
  • HIPE-2022-v2.0-ajmc-train-en.tsv: Two "empty" tokens are in the dataset at line 5.157 and 5.645. Those tokens should be removed or replaces with a non-whitespace.
  • HIPE-2022-v2.0-ajmc-dev-en.tsv: Line 5.660 unfortunately has two tokens (separated with whitespace): περάνας sa

Would be awesome if this could be fixed in the next release(s), I'm going to catch these issues in Flair for now :)

Hello @stefan-it, many thanks for flagging these issues (and for catching them in Flair for the time being!). We will fix them and release a new minor version of the HIPE-2022-data soon.

(For the record: the issue with the two empty tokens is in file HIPE-2022-v2.0-ajmc-dev-en.tsv, not train).