databricks/spark-corenlp

Example Program Issue

anuj-malhotra opened this issue · 3 comments

Hi,
I am trying to run the below example program with Spark 1.6 and Java 1.8.0_60
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import sqlContext.implicits._

val input = Seq(
(1, "Stanford University is located in California. It is a great university.")
).toDF("id", "text")

val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

Its throwing exception on assigning Output variable; error is as - error: bad symbolic reference. A signature in functions.class refers to type UserDefinedFunction
in package org.apache.spark.sql.expressions which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling functions.class.
:36: error: org.apache.spark.sql.expressions.UserDefinedFunction does not take parameters
val output = input.select(cleanxml('text).as('doc)).select(explode(ssplit('doc)).as('sen)).select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

Can you please advise where I am making mistake ?

@mengxr - Could you please help on what I could be doing wrong in the above code ?

Initially I am trying this on spark-shell I started the spark-shell using below command
JAVA_HOME=/usr/java/jdk1.8.0_60/ spark-shell --packages databricks:spark-corenlp:0.2.0-s_2.10,edu.stanford.nlp:stanford-corenlp:3.6.0

I also tried with this piece of code

CoreNLP coreNLP = new CoreNLP()
.setInputCol("text")
.setAnnotators(new String[]{"tokenize", "ssplit", "lemma"})
.setFlattenNestedFields(new String[]{"sentence_token_word"})
.setOutputCol("parsed")
val outputDF = coreNLP.transform(input)

This as well doesn't works as the spark isn't able to locate CoreNLP (giving error as [error: not found: type CoreNLP] ). Could you help on which extra library I need to add or any correction in the code.

@anuj-malhotra You need to pass the CoreNLP models jar file to spark:

spark-shell --jars lib/stanford-corenlp/stanford-corenlp-3.6.0-models.jar \
    --packages databricks:spark-corenlp:0.2.0-s_2.11,edu.stanford.nlp:stanford-corenlp:3.6.0

Worked with Spark 2.0.0 and Scala 2.11

You would probably need an earlier version than databricks:spark-corenlp:0.2.0-s_2.11 to support Spark 1.6. (PS: You can't run Java code in spark-shell, but you can run it with spark-submit once compiled)

lucy3 commented

I had this error, too. I ended up just copying each udf I wanted to use into my code (with the appropriate import statements).

import java.util.Properties

import scala.collection.JavaConverters._

import edu.stanford.nlp.ling.{CoreAnnotations, CoreLabel}
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{CleanXmlAnnotator, StanfordCoreNLP}
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentiment
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import edu.stanford.nlp.simple.{Document, Sentence}
import edu.stanford.nlp.util.Quadruple
import edu.stanford.nlp.trees.Tree

import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import sqlContext.implicits._

def ssplit = udf { document: String =>
    new Document(document).sentences().asScala.map(_.text())
}

val input = Seq(
    (1, "Pies are delicious. Pi day is March 14.")
).toDF("id", "text")

val output = input.select(col("text"), explode(ssplit(col("text"))).as("sent"))

output.show()

using the spark-shell command

spark-shell --master yarn --packages databricks:spark-corenlp:0.2.0-s_2.11 --jars lib/stanford-corenlp-3.9.1-models.jar 

where "lib" can be replaced with where ever your model jar resides.