DependencyParserApproach throws "IllegalArgumentException: For input string: "_"" when training with CONLLU dataset
Arierref46 opened this issue · 5 comments
Is there an existing issue for this?
- I have searched the existing issues and did not find a match.
Who can help?
No response
What are you working on?
I have been trying to train a DependencyParserApproach() but for some reason I have this error (IllegalArgumentException: For input string: "_") and I don't know why. I am using a public train dataset called bosque to train the model with the file "pt_bosque-ud-train.conllu" (https://github.com/UniversalDependencies/UD_Portuguese-Bosque).
Current Behavior
The DependencyParserApproach() throws the error when the .fit() function is called.
Expected Behavior
The DependencyParserApproach() should train normally.
Steps To Reproduce
https://colab.research.google.com/drive/1wyyJfdNSfm0C-r7-h2ri0Xrmzyw6nzgT?usp=sharing
Spark NLP version and Apache Spark
spark 3.3.1
spark-nlp 5.3.2
Type of Spark Application
Python Application
Java Version
openjdk version "1.8.0_402"
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
windows 11
Link to your project (if available)
No response
Additional Information
No response
I share some links here just in case
I am not sure about that data type, but I just tested a file that is like this:
# sent_id = weblog-juancole.com_juancole_20030911085700_ENG_20030911_085700-0022
# text = It should continue to be defanged.
1 It it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 3 nsubj 3:nsubj|6:nsubj:xsubj _
2 should should AUX MD VerbForm=Fin 3 aux 3:aux _
3 continue continue VERB VB VerbForm=Inf 0 root 0:root _
4 to to PART TO _ 6 mark 6:mark _
5 be be AUX VB VerbForm=Inf 6 aux:pass 6:aux:pass _
6 defanged defange VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 3 xcomp 3:xcomp SpaceAfter=No
7 . . PUNCT . _ 3 punct 3:punct _
# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0015
# text = So what happened?
1 So so ADV RB _ 3 advmod 3:advmod _
2 what what PRON WP PronType=Int 3 nsubj 3:nsubj _
3 happened happen VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 0 root 0:root SpaceAfter=No
4 ? ? PUNCT . _ 3 punct 3:punct _
# sent_id = weblog-typepad.com_ripples_20040407125600_ENG_20040407_125600-0055
# text = That too was stopped.
1 That that PRON DT Number=Sing|PronType=Dem 4 nsubj:pass 4:nsubj:pass _
2 too too ADV RB _ 4 advmod 4:advmod _
3 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 4 aux:pass 4:aux:pass _
4 stopped stop VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root SpaceAfter=No
5 . . PUNCT . _ 4 punct 4:punct _
- General example: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/61cb48470ad75c7f33cb771a6a711a253ace62ee/Spark_NLP_Udemy_MOOC/Open_Source/12.01.DependencyParser_TypedDependencyParser.ipynb
- Docs for Dep parser: https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/dependency/dependency_parser/index.html#sparknlp.annotator.dependency.dependency_parser.DependencyParserApproach.setConllU
Can you try with more data? For some reason, I can run with 3 examples from the bosque dataset, but when I added more examples it crashes. Also, can be related to the data is written in portuguese?
Can you try with more data? For some reason, I can run with 3 examples from the bosque dataset, but when I added more examples it crashes. Also, can be related to the data is written in portuguese?
That's interesting! This might be a bug. There is probably a character or a token it doesn't like, it shouldn't crash in my opinion and just skip that row/sentence.
Will assign this for further inspection.
This seems great news! How can I install this fix?
This seems great news! How can I install this fix?
@Arierref46 you just need to update to the latest version of spark-nlp==5.3.3