Redshift

Redshift is a natural-language syntactic dependency parser. The current release features fast and accurate parsing, but requires the text to be pre-processed. Future releases will integrate tokenisation and part-of-speech tagging, and have special features for parsing informal text.

If you don't know what a syntactic dependency is, read this: http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html

Main features:

Fast linear time parsing: the slowest model is still over 100 sentences/second
State-of-the-art accuracy: 93.5% UAS on English (Stanford scheme, WSJ 23)
Super fast "greedy" mode: over 1,000 sentences per second at 91.5% accuracy
Native Python interface (the parser is written in Cython)

Key techniques:

Arc-eager transition-based dependency parser
Averaged perceptron for learning
redshift.parser.BeamParser is basically the model of Zhang and Nivre (2011)
redshift.parser.GreedyParser adds the non-monotonic model of Honnibal et al (2013) to the dynamic oracle model of Goldberg and Nivre (2012)
redshift.features includes the standard Zhang and Nivre (2011) feature set, and also some work pending publication.

Example usage

>>> import redshift.parser
>>> parser = redshift.parser.load_parser('/tmp/stanford_beam8')
>>> import redshift.io_parse
>>> sentences = redshift.io_parse.read_pos('Barry/NNP Wright/NNP ,/, acquired/VBN by/IN Applied/NNP for/IN $/$ 147/CD million/CD ,/, makes/VBZ computer-room/JJ equipment/NN and/CC vibration-control/JJ systems/NNS ./.')
>>> parser.add_parses(sentences)
>>> import sys; sentences.write_parses(sys.stdout)
0       Barry   NNP     1       nn
1       Wright  NNP     11      nsubj
2       ,       ,       1       P
3       acquired        VBN     1       partmod
4       by      IN      3       prep
5       Applied NNP     4       pobj
6       for     IN      3       prep
7       $       $       6       pobj
8       147     CD      7       number
9       million CD      7       number
10      ,       ,       1       P
11      makes   VBZ     -1      ROOT
12      computer-room   JJ      13      amod
13      equipment       NN      11      dobj
14      and     CC      13      cc
15      vibration-control       JJ      16      amod
16      systems NNS     13      conj
17      .       .       11      P

The command-line interfaces have a lot of probably-confusing options for my current research. The main scripts I use are scripts/train.py, scripts/parse.py, and scripts/evaluate.py . All print usage information, and require the plac library.

The following commands will set up a virtualenv with Python 2.7.5, the parser, and its core dependencies from scratch:

From a Unix/OSX terminal, after compilation, and within the "redshift" directory:

$ export PYTHONPATH=`pwd`
$ ./scripts/train.py # Use -h or --help for more detailed info. Most of these are research flags.
usage: train.py [-h] [-a static] [-i 15] [-k 1] [-f 10] [-r] [-d] [-u] [-n 0] [-s 0] train_loc model_loc
train.py: error: too few arguments
$ ./scripts/train.py -k 16 -p <CoNLL formatted training data> <output model directory>
$ ./scripts/parse.py <model directory produced by train.py> <input> <output_dir>
$ ./scripts/evaluate.py output_dir/parses <gold file>

In more detail:

Ensure your PYTHONPATH variable includes the redshift directory
Most of the training-script flags refer to research settings.
the k parameter controls the speed-accuracy trade-off, via the beam-width. Run-time is roughly O(nk), where n is the number of words, and k is the beam-width. In practice it's slightly sub-linear in k due to some simple memoisation. Accuracy plateaus at about k=64. For k=1, use "-a dyn -r -d", to enable some recent special-case wizardry that gives the k=1 case over 1% extra accuracy, at no run-time cost.
The -p flag tells train.py to train a POS tagger.
parse.py reads in the training configuration from "parser.cfg", which sits in the output model directory.
The parser currently expects one sentence per line, space-separated tokens, tokens of the form word/POS.
evaluate.py runs as a separate script from parse.py so that the parser never sees the answers, and cannot "accidentally cheat".

Installation

The following commands will set up a virtualenv with Python 2.7.5, the parser, and its core dependencies from scratch:

$ git clone https://github.com/syllog1sm/redshift.git
$ cd redshift
$ ./make_virtualenv.sh # Downloads Python 2.7.5 and virtualenv
$ source $HOME/rsve/bin/activate
$ ./install_sparsehash.sh # Downloads the Google sparsehash 2.2 library and installs it under the virtualenv
$ pip install cython
$ python setup.py build_ext --inplace # site-install currently broken, use --inplace
$ export PYTHONPATH=`pwd`:$PYTHONPATH # ...and set PYTHONPATH.
$ pip install plac # For command-line interfaces

virtualenv is not a requirement, although it's useful. If a virtualenv is not active (i.e. if the $VIRTUALENV environment variable is not set), install_sparsehash.sh will install the Google sparsehash library under redshift/ext/, to avoid assuming root privileges for the installation. To install sparsehash elsewhere, add the path to the "includes" list in setup.py

You might wish to handle the tasks covered by ./make_virtualenv.sh and ./install_sparsehash.sh yourself, depending on how you want your environment set up.

Cython

redshift is written almost entirely in Cython, a superset of the Python language that additionally supports calling C/C++ functions and declaring C/C++ types on variables and class attributes. This allows the compiler to generate very efficient C/C++ code from Cython code. Many popular Python packages, such as numpy, scipy and lxml, rely heavily on Cython code.

A Cython source file such as learn/perceptron.pyx is compiled into learn/perceptron.cpp and learn/perceptron.so by the project's setup.py file. The module can then by imported by standard Python code, although only the pure-Python functions (declared by "def", instead of "cdef") will be accessible.

The parser currently has Cython as a requirement, instead of distributing the "compiled" .cpp files as part of the release (against Cython's recommendation). This could change in future, but currently it feels strange to have a "source" release that users wouldn't be able to modify.

LICENSE (GPL 3)

I'm still working out how to specify the license, but my intention at the moment is:

FOSS for non-commercial use
Modifications should be distributed
Commercial use licenses available on request. These will be granted pretty much automatically to any company that isn't yet profitable, or really anyone who isn't big.
RESTful parser APIs to make it easier to start using the parser.

Copyright (C) 2013 Matthew Honnibal

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

bplank/redshift

Redshift

Example usage

Installation

Cython

LICENSE (GPL 3)