This a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can either be imported as a module or run as an JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.
It requires pexpect. The repository includes and uses code from jsonrpc and python-progressbar.
There's not much to this script. I decided to create it after having problems using other Python wrappers to Stanford's dependency parser. First the JPypes approach used in stanford-parser-python had trouble initializing a JVM on two separate computers. Next, I discovered I could not use a Jython solution because the Python modules I needed did not work in Jython.
It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly. I have only tested this on Core NLP tools version 1.2.0 released 2011-09-16.
You should have downloaded and unpacked the tgz file containing Stanford's CoreNLP package. Then copy all of the python files from this repository into the stanford-corenlp-2011-09-16
folder.
In other words:
sudo pip install pexpect
wget http://nlp.stanford.edu/software/stanford-corenlp-v1.2.0.tgz
tar xvfz stanford-corenlp-v1.2.0.tgz
cd stanford-corenlp-2011-09-16
git clone git://github.com/dasmith/stanford-corenlp-python.git
mv stanford-corenlp-python/* .
Then, to launch a server:
python corenlp.py
Optionally, you can specify a host or port:
python corenlp.py -H 0.0.0.0 -p 3456
That will run a public JSON-RPC server on port 3456.
Assuming you are running on port 8080, the code in client.py
shows an example parse:
import jsonrpc
from simplejson import loads
server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(),
jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))
result = loads(server.parse("hello world"))
print "Result", result
That returns a list containing a dictionary for each sentence, with keys text
, tuples
of the dependencies, and words
:
Result [{'text': 'hello world',
'tuples': [['amod', 'world', 'hello']],
'words': [['hello', {'NamedEntityTag': 'O', 'CharacterOffsetEnd': 5, 'CharacterOffsetBegin': 0, 'PartOfSpeech': 'JJ', 'Lemma': 'hello'}],
['world', {'NamedEntityTag': 'O', 'CharacterOffsetEnd': 11, 'CharacterOffsetBegin': 6, 'PartOfSpeech': 'NN', 'Lemma': 'world'}]]}]
To use it in a regular script or to edit/debug it (because errors via RPC are opaque), load the module instead:
from corenlp import *
corenlp = StanfordCoreNLP() # wait a few minutes...
corenlp.parse("Parse an imperative sentence, damnit!")
I added a function called parse_imperative
that introduces a dummy pronoun to overcome the problems that dependency parsers have with imperative sentences, dealing with only one at a time.
corenlp.parse("stop smoking")
>> [{"text": "stop smoking", "tuples": [["nn", "smoking", "stop"]], "words": [["stop", {"NamedEntityTag": "O", "CharacterOffsetEnd": 4, "Lemma": "stop", "PartOfSpeech": "NN", "CharacterOffsetBegin": 0}], ["smoking", {"NamedEntityTag": "O", "CharacterOffsetEnd": 12, "Lemma": "smoking", "PartOfSpeech": "NN", "CharacterOffsetBegin": 5}]]}]
corenlp.parse_imperative("stop smoking")
>> [{"text": "stop smoking", "tuples": [["xcomp", "stop", "smoking"]], "words": [["stop", {"NamedEntityTag": "O", "CharacterOffsetEnd": 8, "Lemma": "stop", "PartOfSpeech": "VBP", "CharacterOffsetBegin": 4}], ["smoking", {"NamedEntityTag": "O", "CharacterOffsetEnd": 16, "Lemma": "smoke", "PartOfSpeech": "VBG", "CharacterOffsetBegin": 9}]]}]
Only with the dummy pronoun does the parser correctly identify the first word, stop, to be a verb.
Coreferences are returned in the coref
key, only when they are found as a list of references, e.g. {'coref': [['he','John']]}
.
Stanford CoreNLP tools require a large amount of free memory. Java 5+ uses about 50% more RAM on 64-bit machines than 32-bit machines. 32-bit machine users can lower the memory requirements by changing -Xmx3g
to -Xmx2g
or even less.
If pexpect timesout while loading models, check to make sure you have enough memory and can run the server alone without your kernel killing the java process:
java -cp stanford-corenlp-2011-09-16.jar:stanford-corenlp-2011-09-14-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties
You can reach me, Dustin Smith, by sending a message on GitHub or through email (contact information is available on my webpage).
- Mutex on parser
- Write test functions for parsing accuracy
- Calibrate parse-time prediction as function of sentence inputs