ltgoslo/norne

New conllu format not parsable by Spacy

olavski opened this issue · 5 comments

First of all, thanks to everyone involved for making this dataset available! I've been waiting for a long time for something like this!

I'm trying to train Spacy models using this dataset and I noticed there has been a change in the conllu format recently where entities are now labelled as "name=B-PER" or "SpaceAfter=No name=O" which Spacy are unable to parse..

Until recently the format was like this:

1	Dommer	dommer	NOUN	_	Definite=Ind|Gender=Masc|Number=Sing	2	appos	_	O
2	Finn	Finn	PROPN	_	Gender=Masc	4	nsubj	_	B-PER
3	Eilertsen	Eilertsen	PROPN	_	_	2	name	_	I-PER
4	avstår	avstå	VERB	_	Mood=Ind|Tense=Pres|VerbForm=Fin	0	root	_	O

Now this has changed to:

1	Dommer	dommer	NOUN	_	Definite=Ind|Gender=Masc|Number=Sing	2	nmod	_	name=O
2	Finn	Finn	PROPN	_	Gender=Masc	4	nsubj	_	name=B-PER
3	Eilertsen	Eilertsen	PROPN	_	_	2	flat:name	_	name=I-PER
4	avstår	avstå	VERB	_	Mood=Ind|Tense=Pres|VerbForm=Fin	0	root	_	SpaceAfter=No name=O

When converting to Spacy's json format, entities with the prefix 'name=' are not correct parsed. Also, some NER columns also contains "SpaceAfter=No" which is also not parsed correctly by Spacy.

Are these new changes intentional or is the current format incorrect?

At the moment it is not possible to directly convert the .conllu files to Spacy's json format so for now I'm using this quick fix to make the .conllu files parsable by Spacy:

sed -i 's/SpaceAfter=No name=//g' norne/ud/nob/no_bokmaal-ud-dev.conllu
sed -i 's/name=//g' norne/ud/nob/no_bokmaal-ud-dev.conllu

sed -i 's/SpaceAfter=No name=//g' norne/ud/nob/no_bokmaal-ud-train.conllu
sed -i 's/name=//g' norne/ud/nob/no_bokmaal-ud-train.conllu

Are these new changes intentional or is the current format incorrect?

Yes, the current format is intentional, and it follows the specifications of the CoNLL-U Format misc columns, which is a sort of key-value format key1=value1 key2=value2.

Apologies for first publishing the data on an invalid format, and then fixing it.

I suggest we just add a script to the repo to convert the data to spacy format, e.g. conllu2spacy, and add a note to the README. Please feel free to open a PR if you have time.

Thanks for the reply!

Ok, I see. I'll write a script to convert the files to a Spacy parsable format (where the 10th field just contain the NER tag, like I-PER).

To conform with the CoNLL-U format, shouldn't the MISC values be seperated by |?
So it should be SpaceAfter=No|name=O and not SpaceAfter=No name=O.

You are correct! I'll fix this today, with correct separator and a script to convert to spacy. Thanks for noticing and notifying!

@olavski I've added the script to convert to Spacy format, please test when you got time.

Closing this for now, feel free to re-open if something is still incorrect for usage with Spacy.

@fredrijo Great! That seems to work fine!

In case you are interested, here are the steps I use to train Norwegian Spacy models: https://github.com/web64/spacy-norwegian

Thanks!