JohnSnowLabs/spark-nlp

DependencyParserApproach throws "IllegalArgumentException: For input string: "_"" when training with CONLLU dataset

Arierref46 opened this issue · 5 comments

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I have been trying to train a DependencyParserApproach() but for some reason I have this error (IllegalArgumentException: For input string: "_") and I don't know why. I am using a public train dataset called bosque to train the model with the file "pt_bosque-ud-train.conllu" (https://github.com/UniversalDependencies/UD_Portuguese-Bosque).

Current Behavior

The DependencyParserApproach() throws the error when the .fit() function is called.

Expected Behavior

The DependencyParserApproach() should train normally.

Steps To Reproduce

https://colab.research.google.com/drive/1wyyJfdNSfm0C-r7-h2ri0Xrmzyw6nzgT?usp=sharing

Spark NLP version and Apache Spark

spark 3.3.1
spark-nlp 5.3.2

Type of Spark Application

Python Application

Java Version

openjdk version "1.8.0_402"

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

windows 11

Link to your project (if available)

No response

Additional Information

No response

I share some links here just in case

I am not sure about that data type, but I just tested a file that is like this:

# sent_id = weblog-juancole.com_juancole_20030911085700_ENG_20030911_085700-0022
# text = It should continue to be defanged.
1	It	it	PRON	PRP	Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs	3	nsubj	3:nsubj|6:nsubj:xsubj	_
2	should	should	AUX	MD	VerbForm=Fin	3	aux	3:aux	_
3	continue	continue	VERB	VB	VerbForm=Inf	0	root	0:root	_
4	to	to	PART	TO	_	6	mark	6:mark	_
5	be	be	AUX	VB	VerbForm=Inf	6	aux:pass	6:aux:pass	_
6	defanged	defange	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	3	xcomp	3:xcomp	SpaceAfter=No
7	.	.	PUNCT	.	_	3	punct	3:punct	_

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0015
# text = So what happened?
1	So	so	ADV	RB	_	3	advmod	3:advmod	_
2	what	what	PRON	WP	PronType=Int	3	nsubj	3:nsubj	_
3	happened	happen	VERB	VBD	Mood=Ind|Tense=Past|VerbForm=Fin	0	root	0:root	SpaceAfter=No
4	?	?	PUNCT	.	_	3	punct	3:punct	_

# sent_id = weblog-typepad.com_ripples_20040407125600_ENG_20040407_125600-0055
# text = That too was stopped.
1	That	that	PRON	DT	Number=Sing|PronType=Dem	4	nsubj:pass	4:nsubj:pass	_
2	too	too	ADV	RB	_	4	advmod	4:advmod	_
3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	4	aux:pass	4:aux:pass	_
4	stopped	stop	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	0:root	SpaceAfter=No
5	.	.	PUNCT	.	_	4	punct	4:punct	_

Can you try with more data? For some reason, I can run with 3 examples from the bosque dataset, but when I added more examples it crashes. Also, can be related to the data is written in portuguese?

Can you try with more data? For some reason, I can run with 3 examples from the bosque dataset, but when I added more examples it crashes. Also, can be related to the data is written in portuguese?

That's interesting! This might be a bug. There is probably a character or a token it doesn't like, it shouldn't crash in my opinion and just skip that row/sentence.

Will assign this for further inspection.

This seems great news! How can I install this fix?

This seems great news! How can I install this fix?

@Arierref46 you just need to update to the latest version of spark-nlp==5.3.3