JohnSnowLabs/spark-nlp-workshop

The example for OCR does not match the version of spark-nlp in the Docker (spark-nlp==2.0.3)

Chertushkin opened this issue · 9 comments

Function spark.start_with_ocr() does not exist in the spark-nlp==2.0.3

Hello, thanks for the awesome repository. I am trying to proceed with the example from "explain-document-dl" and I get 2 blocking issues.

Steps to Reproduce

  1. Pull and run the docker.

  2. Run notebook https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL%20with%20OCR.ipynb

  3. Try to launch the cell #1. You will get the exception saying that sparknlp.start_with_ocr() method does not exist.
    image

  4. If you change it to simple sparknlp.start() you can proceed forward, but then on the cell #4 you will get another exception about OcrHelper(). Looks like in new version of spark-nlp OcrHelper() is static, and previously it was an instance method.
    image

  5. Also, I have found that it is possible to launch OCR in new version with this call: sparknlp.start(include_ocr=True). However, after rather some time it still crashes and does not work.

Your Environment

  • Spark-NLP version: 2.0.3 and 2.0.1
  • Apache Spark version: 2.4.1
  • Operating System and version: 3 configurations - Ubuntu 16.04, Ubuntu 18.04 and Docker
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): I have tried your Docker container, it installed spark-nlp==2.0.3. Also, I manually installed on my own host machine with Ubuntu 16.04 spark-nlp==2.0.1 and tried 2.0.3 as well. And on Ubuntu 18.04 both versions of spark-nlp. On these 3 configurations it does not work.

Update. I have included the jar com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.1 into the spark session and it resolved the second problem: now OcrHelper() is working. But the first problem (with include_ocr=True) is still there. Let me know if you have any questions.

Hi @Chertushkin,
The problem is that we made a mistake by not locking the version for spark-nlp and PySpark in the Dockerfile! With the new release now it downloads the newest version which has different APIs.

I am going to lock it on 2.0.3 for spark-nlp and 2.4.0 for PySpark and change this to its new API.
Also, sparknlp.start(include_ocr=True) is supposed to get the JAR for OCR automatically.

I'll keep you updated, and thanks for reporting this issue.

OK, I have fixed the versions on Spark NLP and PySpark. Also, updated two notebooks which are using OCR:

As you can see it is very simple. If you are using Spark NLP 2.0.3 and PySpark 2.4.0 all you need to do is sparknlp.start(include_ocr=True). However, this only works in 2.0.3.

If you wait 30 minutes to 1 hour, the Docker will build the latest image and you should be able to pull and run with the new changes. (it's best to do docker system prune to remove unused resources first)

Amazing! Okay I am going to remove docker image completely and pull it again in next hour. Thanks for fast turnaround :)

Hello again, I have pulled the latest docker image and launched the very same example "Explain Document DL with OCR".

I receive now the same exception that I previously had with start(include_ocr=True):

image

I am using the latest docker, which you have built 7 hours ago. Here is my docker images:

image

Could you please check on your side?

Due to an error in Docker auto-build, it only built it roughly 7 hours ago. So I am guessing your image is the latest but not the one I pushed. However, to avoid this I just started the build for a tag called 2.0.3. This way we are sure that you have the right image.
I'll keep you updated when it's over in 20 minutes.

Cloud you please try with the latest one more time? Removing all the images also would help.

Hello, I have tried and it does not work. I have completely removed all docker images and pulled the latest one.

I am attaching the stack trace where it crashed. I hope it might help you. Thanks in advance.
image

Fixed in 2.1.0 release.