nlp-uoregon/trankit

Parse error of Italian

Closed this issue · 1 comments

I used Italian model for predicting the dependency tree and obtained following result:

1	Il	il	DET	RD	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	2	det	_
2	termine	termine	NOUN	S	Gender=Masc|Number=Sing	8	nsubj:pass	_	_
3	"	"	PUNCT	FB	_	4	punct	_	_
4	Tathāgata	Tathāgata	PROPN	SP	_	2	nmod	_	_
5	"	"	PUNCT	FB	_	4	punct	_	_
6	può	potere	AUX	VM	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	8	aux	_
7	essere	essere	AUX	VA	VerbForm=Inf	8	aux:pass	_	_
8	letto	leggere	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_
9	come	come	ADP	E	_	11	case	_	_
10	"	"	PUNCT	FB	_	11	punct	_	_
11	tathā-gata	tathā-gata	NOUN	S	Gender=Fem|Number=Sing	8	obl	_	_
12	"	"	PUNCT	FB	_	11	punct	_	_
13	o	o	CCONJ	CC	_	16	cc	_	_
14	come	come	ADP	E	_	16	case	_	_
15	"	"	PUNCT	FB	_	16	punct	_	_
16	Tathā-āgata	Tathā-āgata	PROPN	SP	_	11	conj	_	_
17	"	"	PUNCT	FB	_	16	punct	_	_
18	,	,	PUNCT	FF	_	16	punct	_	_
19	dove	dove	ADV	B	_	22	advmod	_	_
20	il	il	DET	RD	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	21	det	_
21	primo	primo	ADJ	NO	Gender=Masc|Number=Sing|NumType=Ord	22	nsubj	_	_
22	significa	significare	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	16	acl:relcl	_	_
23	"	"	PUNCT	FB	_	25	punct	_	_
24	così	così	ADV	B	_	25	advmod	_	_
25	andato	andare	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	22	xcomp	_
26	"	"	PUNCT	FB	_	25	punct	_	_
27	mentre	mentre	CCONJ	CC	_	30	cc	_	_
28	il	il	DET	RD	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	29	det	_
29	secondo	secondo	ADJ	NO	Gender=Masc|Number=Sing|NumType=Ord	30	nsubj	_	_
30	significa	significare	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	22	conj	_	_
31	"	"	PUNCT	FB	_	32	punct	_	_
32	così venuto	così venuto	ADV	B	_	30	advmod	_	_
33	"	"	PUNCT	FB	_	32	punct	_	_
34	.	.	PUNCT	FS	_	8	punct	_	_

I think line 32 is invalid because it contains space within one token.

What is curious is in another sentence containing 'così venuto', these two words are regarded as separated tokens:

1	Così	così	ADV	B	_	2	advmod	_	_
2	venuto	venire	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_	_
3	/	/	PUNCT	FF	_	2	punct	_	_
4	Così	così	ADV	B	_	5	advmod	_	_
5	andato	andare	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	2	conj	_	_
6	.	.	PUNCT	FS	_	2	punct	_	_

Is this a bug? I'd appreciate it if you could investigate this issue.

Hi @gifdog97,
Thanks for reporting the issue.
Our tokenizer is a neural-based model that was trained on text corpus so it is possible that the tokenization of the same piece of text may vary depending on context.