1.SparkNLP_Basics.ipynb breaks, PretrainedPipeline() java error

Question

1.SparkNLP_Basics.ipynb breaks, PretrainedPipeline() java error

moseswmwong opened this issue 3 years ago · 6 comments

I opened "1.SparkNLP_Basics.ipynb" using its "open in colab" button, and run on colab, it fail at Code Cell 9 "pipeline = PretrainedPipeline('explain_document_ml', lang='en')" when java calls error giving the message - Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: org/json4s/package$MappingException

Description

It should be very easy to repeat the error, see below.

Steps to Reproduce

go to
https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public
click on "1.SparkNLP_Basics.ipynb"
right click on "Open in Colab" to start a new browser tab for Colab with this notebook
Click "Copy to Drive" so it creates a copy in my own Google Colab account
Set to use GPU
Run all

Here is the error at Cell 9 "pipeline = PretrainedPipeline('explain_document_ml', lang='en')"

Error Messages

explain_document_ml download started this may take some time.

Py4JJavaError Traceback (most recent call last)
in ()
----> 1 pipeline = PretrainedPipeline('explain_document_ml', lang='en')

8 frames
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: org/json4s/package$MappingException
at org.json4s.ext.EnumNameSerializer.deserialize(EnumSerializer.scala:53)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:180)
at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:167)
at scala.collection.AbstractTraversable.collectFirst(Traversable.scala:108)
at org.json4s.Formats$.customDeserializer(Formats.scala:66)
at org.json4s.Extraction$.customOrElse(Extraction.scala:775)
at org.json4s.Extraction$.extract(Extraction.scala:454)
at org.json4s.Extraction$.extract(Extraction.scala:56)
at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:79)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:107)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:106)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:593)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at scala.collection.AbstractIterator.to(Iterator.scala:1431)
at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
at scala.collection.AbstractIterator.toList(Iterator.scala:1431)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:106)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:101)
at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:78)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:444)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:572)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: org.json4s.package$MappingException
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 51 more

Furthermore

Then I use a Ubuntu 18.04, java (openjava) 8, spark nlp 3.3.1, apache spark 3.2.0 with GPU/cuda, download the Jupyter notebook to it and the exact same problem occurs. Then I switch to java (openjava) 11, and the exact same problem occurs. Note that I added "! java -version" to Colabl to check the environment and found that Colab is using java 11 instead of the recommended java 8

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb

Your Environment

Colab

Spark-NLP version: 3.3.1
Apache Spark version: 3.2.0
Operating System and version: Colab
Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

Linux

Spark-NLP version: 3.3.1
Apache Spark version: 3.2.0
Operating System and version: Ubuntu 18.04 64 bits, Anaconda, Python 3.8
Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

Note

I notice cell 4 "! cd ~/.ivy2/cache/com.johnsnowlabs.nlp/spark-nlp_2.12/jars && ls -lt" output spark-nlp_2.12-3.3.1.jar, while it used to output display spark-nlp_2.12-3.3.0.jar

I have pending exam with Spark NLP for Data Scientist and Spark NLP for Healthcare Data Scientist training, your prompt response and resolution is greatly appreciated.

Answer 1 · 2021-10-19T00:23:55.000Z

I'm also having this error when trying to load any pretrained model in this notebook on colab. Based on this issue, I installed openjdk 8 and switched to java version to 8 on the colab, but the error persists.

Answer 2 · 2021-10-19T00:33:00.000Z

Actually, I see pyspark was updated in PyPi today maybe causing spark incompatibility; if you install spark 3.1.2 this error goes away, just change to !pip install pyspark==3.1.2.

Answer 3 · 2021-10-20T08:26:30.000Z

pyspark 3.1.2 fix this problem, thanks!

Answer 4 · 2022-07-28T21:43:56.000Z

I am running into the same issue. What other options are there if I cannot downgrade to pyspark 3.1.2 ?

Answer 5 · 2022-07-28T22:07:56.000Z

This issue over on Databricks solved the issue for me
JohnSnowLabs/spark-nlp#6772 (comment)

I had to use spark-nlp-spark32.
https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-spark32_2.12/3.4.4

so with my sparkSession as below, this error went away.

spark = ( SparkSession.builder
    .appName("spark-nlp-spark32")
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4")
    .getOrCreate()
)

However, now I am getting the following error, which I'll post to a separate issue.

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. :
org.apache.spark.SparkException: 
Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 4 times, most recent failure: 
Lost task 0.3 in stage 0.0 (TID 3) (1.1.1.1 executor 4): 
java.io.FileNotFoundException: File 
file:/home/user1/cache_pretrained/spellcheck_norvig_en_3.1.3_3.0_1631046343759/metadata/part-00000 does not exist

Answer 6 · 2022-07-28T22:15:14.000Z

The 2nd error I ran into already has a post:
JohnSnowLabs/spark-nlp#6863