Stanford CoreNLP wrapper for Apache Spark
This package wraps Stanford CoreNLP annotators as Spark DataFrame functions following the simple APIs introduced in Stanford CoreNLP 3.7.0.
This package requires Java 8 and CoreNLP 3.7.0 to run. Users must include CoreNLP model jars as dependencies to use language models.
All functions are defined under com.databricks.spark.corenlp.functions
.
cleanxml
: Cleans XML tags in a document and returns the cleaned document.tokenize
: Tokenizes a sentence into words.ssplit
: Splits a document into sentences.pos
: Generates the part of speech tags of the sentence.lemma
: Generates the word lemmas of the sentence.ner
: Generates the named entity tags of the sentence.parse
: Generates the consistuency dependencies of the sentence as aString
in Penn Treebank style.depparse
: Generates the dependency graph of the sentence and returns a flattened list of(source, sourceIndex, relation, target, targetIndex, weight)
relation tuples.coref
: Generates the coref chains in the document and returns a list of(rep, mentions)
chain tuples, wherementions
are in the format of(sentNum, startIndex, mention)
.natlog
: Generates the Natural Logic notion of polarity for each token in a sentence, returned asup
,down
, orflat
.openie
: Generates a list of Open IE triples as flat(subject, relation, target, confidence)
tuples.sentiment
: Measures the sentiment of an input sentence on a scale of 0 (strong negative) to 4 (strong positive).
Users can chain the functions to create pipeline, for example:
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import sqlContext.implicits._
val input = Seq(
(1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
output.show(truncate = false)
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|sen |words |nerTags |sentiment|
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]|[ORGANIZATION, ORGANIZATION, O, O, O, LOCATION, O]|1 |
|It is a great university . |[It, is, a, great, university, .] |[O, O, O, O, O, O] |4 |
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
Acknowledgements
Many thanks to Jason Bolton from the Stanford NLP Group for API discussions.
To build
sbt +publishLocal
Using on your own project
In SBT :
resolvers += "GitHub nekonyuu artifacts - releases" at "https://artifacts.nyuu.eu/releases/maven"
libraryDependencies ++= Seq(
"nekonyuu" %% "spark-corenlp" % "0.3.0"
)
In maven :
<repositories>
<repository>
<id>GitHub nekonyuu artifacts - releases</id>
<url>https://artifacts.nyuu.eu/releases/maven</url>
</repository>
</repositories>
<dependency>
<groupId>nekonyuu</groupId>
<artifactId>spark-corenlp</artifactId>
<version>${sparkNlpVersion}</version>
</dependency>