Minor issues in AJMC v2.0
Closed this issue · 2 comments
stefan-it commented
Hi,
I've just written some testcases for reading the v2.0 version of the corpus in Flair, and it seems that there are some issues for AJMC:
- HIPE-2022-v2.0-ajmc-train-de.tsv: In line 16.537 the token
ἄνδοα
starts with a leading whitespace (very minor issue). Leading spaces also appear in other AJMC splits. - HIPE-2022-v2.0-ajmc-train-en.tsv: Two "empty" tokens are in the dataset at line 5.157 and 5.645. Those tokens should be removed or replaces with a non-whitespace.
- HIPE-2022-v2.0-ajmc-dev-en.tsv: Line 5.660 unfortunately has two tokens (separated with whitespace):
περάνας sa
Would be awesome if this could be fixed in the next release(s), I'm going to catch these issues in Flair for now :)
mromanello commented
Hello @stefan-it, many thanks for flagging these issues (and for catching them in Flair for the time being!). We will fix them and release a new minor version of the HIPE-2022-data soon.
mromanello commented
(For the record: the issue with the two empty tokens is in file HIPE-2022-v2.0-ajmc-dev-en.tsv
, not train
).