/bllip-parser

Official repository for the BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser)

Primary LanguageC++

~BLLIP/reranking-parser/README

(c) Mark Johnson, Eugene Charniak, 24th November 2005 --- August 2006

We request acknowledgement in any publications that make use of this
software and any code derived from this software.  Please report the
release date of the software that you are using, as this will enable
others to compare their results to yours.

COMPILING AND RUNNING THE PARSER
================================

To compile the two-stage parser, first define GCCFLAGS appropriately
for your machine, e.g., with csh or tcsh:

> setenv GCCFLAGS "-march=pentium4 -mfpmath=sse -msse2 -mmmx"

or

> setenv GCCFLAGS "-march=opteron -m64"

(if unsure, it is safe to leave GCCFLAGS unset -- the defaults are
generally good these days)

Then execute 

> make

    Sidenote on compiling on OS X
    -----------------------------
    OS X uses the clang compiler by default which may cause compilation
    problems.  Try setting this environment variable to change the
    default C++ compiler:

    > setenv CXX g++

    See BLLIP#13 for more
    information.

After it has built, the parser can be run with

> parse.sh <sourcefile.txt>

E.g.,

> parse.sh sample-text/sample-data.txt

The input text must be pre-sentence segmented with each sentence in an <s> tag:

    <s> Sentence 1 </s>
    <s> Sentence 2 </s>
    ...

Note that there needs to be a space before and after the sentence.

The script parse-eval.sh takes a list of treebank files as arguments
and extracts the terminal strings from them, runs the two-stage parser
on those terminal strings and then evaluates the parsing accuracy with
the Sparseval program[*].  For example, on my machine the Penn
Treebank 3 CD-ROM is installed at /usr/local/data/Penn3/, so the
following code evaluates the two-stage parser on section 24.

> parse-eval.sh /usr/local/data/Penn3/parsed/mrg/wsj/24/wsj*.mrg

The Makefile will attempt to automatically download and build Sparseval
for you if you run "make sparseval".

[*] Sparseval is available from 
	http://old-site.clsp.jhu.edu/ws2005/groups/eventdetect/files/SParseval.tgz
    See this paper for more information:
    @inproceedings{roark2006sparseval,
        title={SParseval: Evaluation metrics for parsing speech},
        author={Roark, Brian and Harper, Mary and Charniak, Eugene and Dorr, Bonnie and Johnson, Mark and Kahn, Jeremy G and Liu, Yang and Ostendorf, Mari and Hale, John and Krasnyanskaya, Anna and others},
        booktitle={Proc. LREC},
        year={2006}
    }

    We no longer distribute evalb with the parser, but it is still available:
    http://nlp.cs.nyu.edu/evalb/

USING THE PARSER
================

See first-stage/README for more information.

TRAINING THE RERANKER
=====================

Retraining the reranker takes a considerable amount of time, disk
space and RAM.  At Brown we use a dual Opteron machine with 16Gb RAM,
and it takes around two days.  You should be able to do it with only
8Gb RAM, and maybe even with 4Gb RAM with an appropriately tweaked
kernel (e.g., sysctl overcommit_memory, and a so-called 4Gb/4Gb split
if you're using a 32-bit OS).  

The time and memory you need depend on the features that the reranker
extracts and the size of the n-best tree training and development
data.  You can change the features that are extracted by changing
second-stage/programs/features/features.h, and you can reduce the size
of the n-best tree data by reducing NPARSES in the Makefile from 50
to, say, 25.

You will need to edit the Makefile in order to retrain the reranker.

First, you need to set the variable PENNWSJTREEBANK in Makefile to the
directory that holds your version of the Penn WSJ Treebank.  On my
machine this is:

PENNWSJTREEBANK=/usr/local/data/Penn3/parsed/mrg/wsj/

On estimators: There are multiple estimators one can use when retraining
the reranker.  cvlm and cvlm-owlqn are the main ones of interest.
cvlm-owlqn is significantly faster than cvlm but unfortunately has been
removed from this distribution due to licensing conflicts (but see the
license section at the bottom of this file).

If you're using cvlm as your estimator (the default), you'll also need
the Boost C++ and the libLBFGS library in order to retrain the
reranker.  If you're using cvlm-owlqn as your estimator, you can ignore
this. libLBFGS is available at http://www.chokkan.org/software/liblbfgs/
under the MIT license. In Ubuntu, you'll need the liblbfgs-dev package:

> sudo apt-get install liblbfgs-dev

For older versions of Ubuntu, you may need to install a PPA to get
liblbfgs-dev:

> sudo add-apt-repository --yes ppa:ktm5j/uva-cs-ppa
> sudo apt-get update

Boost can be obtained from http://www.boost.org/ or with the libboost-dev
package in Ubuntu:

> sudo apt-get install libboost-dev

While many modern Linux distributions come with the Boost C++
libraries pre-installed, if the Boost C++ libraries are not included
in your standard libraries and headers, you will need to install them
and add an include file specification for them in your GCCFLAGS.  For
example, if you have installed the Boost C++ libraries in
/home/mj/C++/boost, then your GCCFLAGS environment variable should be
something like:

> setenv GCCFLAGS "-march=pentium4 -mfpmath=sse -msse2 -mmmx -I /home/mj/C++/boost"

or

> setenv GCCFLAGS "-march=opteron -m64 -I /home/mj/C++/boost"

Once this is set up, you retrain the reranker as follows:

> make reranker 
> make nbesttrain
> make eval-reranker

The script train-eval-reranker.sh does all of this.

The reranker goal builds all of the programs, nbesttrain constructs
the 20 folds of n-best parses required for training, and eval-reranker
extracts features, estimates their weights and evaluates the
reranker's performance on the development data (dev) and the two test
data sets (test1 and test2).

If you have a parallel processor, you can run 2 (or more) jobs
in parallel by running

> make -j 2 nbesttrain

Currently this only helps for nbesttrain (but this is the slowest
step, so maybe this is not so bad).

The Makefile contains a number of variables that control how the
training process works.  The most important of these is the VERSION
variable.  You should do all of your experiments with VERSION=nonfinal,
and only run with VERSION=final once to produce results for publication.

If VERSION is nonfinal then the reranker trains on WSJ PTB sections
2-19, sections 20-21 are used for development, section 22 is used as
test1 and section 24 is used as test2 (this approximately replicates
the Collins 2000 setup).

If VERSION is final then the reranker trains on WSJ PTB sections 2-21,
section 24 is used for development, section 22 is used as test1 and
section 23 is used as test2.

The Makefile also contains variables you may want to change, such as
NBEST, which specfies how many parses per sentence are extracted from
each sentence, and NFOLDS, which specifies how many folds are created.

If you decide to experiment with new features or new feature weight
estimators, take a close look at the Makefile.  If you change the
features please also change FEATURESNICKNAME; this way your new
features won't over-write our existing ones.  Similarly, if you change
the feature weight estimator please pick a new ESTIMATORNICKNAME and
if you change the n-best parser please pick a new NBESTPARSERNICKNAME;
this way you new n-best parses or feature weights won't over-write the
existing ones.

To get rid of (many of) the object files produced in compilation, run:

> make clean

Training, especially constructing the 20 folds of n-best parses,
produces a lot of temporary files which you can remove if you want to.
To remove the temporary files used to construct the 20 fold n-best
parses, run:

> make nbesttrain-clean

All of the information needed by the reranker is in
second-stage/models.  To remove everything except the information
needed for running the reranking parser, run:

> make train-clean

To clean up everything, including the data needed for running the
reranking parser, run:

> make real-clean

MULTI-THREADED PARSING
======================

The first stage parser, which uses about 95% of the time, is multitreaded.
However, multithreading support does NOT appear to be stable at this
time and its use is discouraged.  The default is to only use one thread
which does not cause problems.  If you're willing to try your luck with
multithreading, currently the maximum is 64 threads.  To change the
number (or maximum) see the README file for the first-stage parser.
For the time being a non-threaded version (oparseIt) is available in
case there are problems with threads.

To use the non-threaded parser instead change the following line in
the Makefile:

NBESTPARSER=first-stage/PARSE/parseIt

to

NBESTPARSER=first-stage/PARSE/oparseIt

That is, it is identical except for the "o" in oparseIt

Then run oparse.sh, rather than parse.sh.