databricks/spark-corenlp

Implemetation takes really long for giving putput

raviranjan-innoplexus opened this issue · 1 comments

Hi, I am using this library but am getting extremely slow results. For 10k records containing some texts, it has taken longer than 16 hours to process 160 tasks out of 1920 after re-partitioning. I am wonder if the name extraction is working parallely or do other executors queue one after the other for name entity recognition to happen. Python non-parallel scripts seem to work faster than this. Any suggestion, work arounds would be highly appreciated

I'm experiencing the same extreme slowness when performing a benchmark against NLTK (Vader) and Spark-core (JohnSnow).

For 1 million rows of sentiment analysis:

  • Spark-Core NLP (JohnSnow 1.6.3) finishes the job in 4 min 30 secs.
  • NLTK (Vader) NLP finishes the job in 6 min 30 secs.
  • Stanford-Core NLP never finishes the job, takes more than 1 hour.