New conllu format not parsable by Spacy
olavski opened this issue · 5 comments
First of all, thanks to everyone involved for making this dataset available! I've been waiting for a long time for something like this!
I'm trying to train Spacy models using this dataset and I noticed there has been a change in the conllu format recently where entities are now labelled as "name=B-PER" or "SpaceAfter=No name=O" which Spacy are unable to parse..
Until recently the format was like this:
1 Dommer dommer NOUN _ Definite=Ind|Gender=Masc|Number=Sing 2 appos _ O
2 Finn Finn PROPN _ Gender=Masc 4 nsubj _ B-PER
3 Eilertsen Eilertsen PROPN _ _ 2 name _ I-PER
4 avstår avstå VERB _ Mood=Ind|Tense=Pres|VerbForm=Fin 0 root _ O
Now this has changed to:
1 Dommer dommer NOUN _ Definite=Ind|Gender=Masc|Number=Sing 2 nmod _ name=O
2 Finn Finn PROPN _ Gender=Masc 4 nsubj _ name=B-PER
3 Eilertsen Eilertsen PROPN _ _ 2 flat:name _ name=I-PER
4 avstår avstå VERB _ Mood=Ind|Tense=Pres|VerbForm=Fin 0 root _ SpaceAfter=No name=O
When converting to Spacy's json format, entities with the prefix 'name=' are not correct parsed. Also, some NER columns also contains "SpaceAfter=No" which is also not parsed correctly by Spacy.
Are these new changes intentional or is the current format incorrect?
At the moment it is not possible to directly convert the .conllu files to Spacy's json format so for now I'm using this quick fix to make the .conllu files parsable by Spacy:
sed -i 's/SpaceAfter=No name=//g' norne/ud/nob/no_bokmaal-ud-dev.conllu
sed -i 's/name=//g' norne/ud/nob/no_bokmaal-ud-dev.conllu
sed -i 's/SpaceAfter=No name=//g' norne/ud/nob/no_bokmaal-ud-train.conllu
sed -i 's/name=//g' norne/ud/nob/no_bokmaal-ud-train.conllu
Are these new changes intentional or is the current format incorrect?
Yes, the current format is intentional, and it follows the specifications of the CoNLL-U Format misc columns, which is a sort of key-value format key1=value1 key2=value2
.
Apologies for first publishing the data on an invalid format, and then fixing it.
I suggest we just add a script to the repo to convert the data to spacy format, e.g. conllu2spacy
, and add a note to the README. Please feel free to open a PR if you have time.
Thanks for the reply!
Ok, I see. I'll write a script to convert the files to a Spacy parsable format (where the 10th field just contain the NER tag, like I-PER).
To conform with the CoNLL-U format, shouldn't the MISC values be seperated by |
?
So it should be SpaceAfter=No|name=O
and not SpaceAfter=No name=O
.
You are correct! I'll fix this today, with correct separator and a script to convert to spacy. Thanks for noticing and notifying!
@olavski I've added the script to convert to Spacy format, please test when you got time.
Closing this for now, feel free to re-open if something is still incorrect for usage with Spacy.
@fredrijo Great! That seems to work fine!
In case you are interested, here are the steps I use to train Norwegian Spacy models: https://github.com/web64/spacy-norwegian
Thanks!