JohnSnowLabs/spark-nlp-workshop

1.SparkNLP_Basics.ipynb breaks, PretrainedPipeline() java error

moseswmwong opened this issue · 6 comments

I opened "1.SparkNLP_Basics.ipynb" using its "open in colab" button, and run on colab, it fail at Code Cell 9 "pipeline = PretrainedPipeline('explain_document_ml', lang='en')" when java calls error giving the message - Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: org/json4s/package$MappingException

Description

It should be very easy to repeat the error, see below.

Steps to Reproduce

  1. go to
    https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public
  2. click on "1.SparkNLP_Basics.ipynb"
  3. right click on "Open in Colab" to start a new browser tab for Colab with this notebook
  4. Click "Copy to Drive" so it creates a copy in my own Google Colab account
  5. Set to use GPU
  6. Run all

Here is the error at Cell 9 "pipeline = PretrainedPipeline('explain_document_ml', lang='en')"

Error Messages

explain_document_ml download started this may take some time.

Py4JJavaError Traceback (most recent call last)
in ()
----> 1 pipeline = PretrainedPipeline('explain_document_ml', lang='en')

8 frames
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: org/json4s/package$MappingException
at org.json4s.ext.EnumNameSerializer.deserialize(EnumSerializer.scala:53)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:180)
at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:167)
at scala.collection.AbstractTraversable.collectFirst(Traversable.scala:108)
at org.json4s.Formats$.customDeserializer(Formats.scala:66)
at org.json4s.Extraction$.customOrElse(Extraction.scala:775)
at org.json4s.Extraction$.extract(Extraction.scala:454)
at org.json4s.Extraction$.extract(Extraction.scala:56)
at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:79)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:107)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:106)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:593)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at scala.collection.AbstractIterator.to(Iterator.scala:1431)
at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
at scala.collection.AbstractIterator.toList(Iterator.scala:1431)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:106)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:101)
at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:78)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:444)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:572)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: org.json4s.package$MappingException
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 51 more

Furthermore

Then I use a Ubuntu 18.04, java (openjava) 8, spark nlp 3.3.1, apache spark 3.2.0 with GPU/cuda, download the Jupyter notebook to it and the exact same problem occurs. Then I switch to java (openjava) 11, and the exact same problem occurs. Note that I added "! java -version" to Colabl to check the environment and found that Colab is using java 11 instead of the recommended java 8

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb

Your Environment

Colab

  • Spark-NLP version: 3.3.1
  • Apache Spark version: 3.2.0
  • Operating System and version: Colab
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

Linux

  • Spark-NLP version: 3.3.1
  • Apache Spark version: 3.2.0
  • Operating System and version: Ubuntu 18.04 64 bits, Anaconda, Python 3.8
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

Note

I notice cell 4 "! cd ~/.ivy2/cache/com.johnsnowlabs.nlp/spark-nlp_2.12/jars && ls -lt" output spark-nlp_2.12-3.3.1.jar, while it used to output display spark-nlp_2.12-3.3.0.jar

I have pending exam with Spark NLP for Data Scientist and Spark NLP for Healthcare Data Scientist training, your prompt response and resolution is greatly appreciated.

I'm also having this error when trying to load any pretrained model in this notebook on colab. Based on this issue, I installed openjdk 8 and switched to java version to 8 on the colab, but the error persists.

Actually, I see pyspark was updated in PyPi today maybe causing spark incompatibility; if you install spark 3.1.2 this error goes away, just change to !pip install pyspark==3.1.2.

pyspark 3.1.2 fix this problem, thanks!

I am running into the same issue. What other options are there if I cannot downgrade to pyspark 3.1.2 ?

This issue over on Databricks solved the issue for me
JohnSnowLabs/spark-nlp#6772 (comment)

I had to use spark-nlp-spark32.
https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-spark32_2.12/3.4.4

so with my sparkSession as below, this error went away.

spark = ( SparkSession.builder
    .appName("spark-nlp-spark32")
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4")
    .getOrCreate()
)

However, now I am getting the following error, which I'll post to a separate issue.

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. :
org.apache.spark.SparkException: 
Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 4 times, most recent failure: 
Lost task 0.3 in stage 0.0 (TID 3) (1.1.1.1 executor 4): 
java.io.FileNotFoundException: File 
file:/home/user1/cache_pretrained/spellcheck_norvig_en_3.1.3_3.0_1631046343759/metadata/part-00000 does not exist

The 2nd error I ran into already has a post:
JohnSnowLabs/spark-nlp#6863