/anno-pipeline

tool chain used for Annotated Gigaword

Primary LanguageJava

# This is the pipeline used to annotate Gigaword English v.5
# (Annotated Gigaword, Napoles et al. 2012). 
#
# Courtney Napoles, cdnapoles@gmail.com
# 2012-07-03
# edited Frank Ferraro, ferraro@cs.jhu.edu
#        2013-06-06 to 2013-06-12

NOTES

See pipeline.sh for the full pipeline and usage of individual steps
If you are running a copy of this, you need to modify scripts/
splitta.1.03/sbd.py so that the paths for SVM_LEARN and SVM_CLASSIFY
point to your installation.

Note that the pipeline uses a parallel environment (8 threads for 
parsing) so please set your configurations accordingly 
(qsub -l num_proc=8,mem_free=16G,h_vmem=22G).

Be sure to set the environment encoding to UTF-8. Also, make sure your
dotfiles do not overwrite PYTHONPATH or PERL5LIB.


USAGE
The default is to run in parallel across different nodes:
./pipeline.sh file_to_annotate.xml working_directory [recaser_host] [OPTIONS]

To run *sequentially* on current machine:

./pipeline.sh file_to_annotate.xml working_directory [recaser_host] --qsub f [OPTIONS]

To run *sequentially* on remote machine, qsub the script run_sequential_grid.sh, e.g.

qsub "./run_sequential_grid.sh file_to_annotate.xml working_directory [recaser_host] [OPTIONS]"

To just run the annotators:
java -Xmx16g -cp bin:lib/stanford-corenlp-2012-05-22.jar:lib/my-xom.jar:lib/stanford-corenlp-2012-05-22-models.jar:lib/joda-time.jar \
     edu.jhu.annotation.GigawordAnnotator --in <TESTFILE>

If you'd like to annotate a file that contains a single document 
without any SGML markup, add "--sgml f". However, for annotating a 
large quantity of files this is unadvisable, because loading the 
Stanford models takes a couple of minutes. It is more efficient to
include several documents in one file (and documents should be
formatted like <DOC><TEXT>parses</TEXT></DOC>). 

FILE FORMAT

sample.txt contains a sample file format. If using SGML markup
(which is recommended because then multiple documents can be stored
in the same file), the following format is assumed:

<DOC id="xx">
<TEXT>
...
</TEXT>
</DOC>
<DOC ...

Any tags in between <DOC> and <TEXT> are ignored but passed through
intact. The only tag allowed in <TEXT> is <P>. All text in the
<TEXT> element will be processed and annotated. The pipeline assumes
that each line is EITHER sgml markup or text (so do not put a tag
on the same line as text. The pipeline does not detect/correct 
invalid SGML but it will convert SGML to XML (by adding a root 
element and escaping <, >, and &. 

small-sample.txt is a much smaller version of sample.txt. See 
example_output/small_sample/20130618-142319/ for example output
(final file is small-sample.annotated.xml)

DEPENDENCIES

Software versions used:
Splitta 1.03
Stanford CoreNLP 1.3.2
HTML::Entities

Requirements:
jgrapht.jar
joda-time.jar
my-xom.jar
splitta.1.03
stanford-corenlp-2012-05-22.jar
stanford-corenlp-2012-05-22-models.jar
svm_light.6.02
umd-parser.jar
wsj-6.pml	# grammar file

The scripts assumes these will by default be in ./lib.

If you want to do true-casing, you need to start the recaser server. Default script
is provided in scripts/start-recaser.sh

BASH VARIABLES

There are a number of variables that control the program. The most important 
ones are

* DIR_TO_SINGLE_FILE 
  * script to convert a directory of files into one
  * defaults to scripts/raw_text_to_agiga_input.pl
  * not applicable if the input argument is a file
* RECASER_SCRIPT
  * script to handle true casing
  * defaults to scripts/recase.sh
* MAIL_OPTIONS (default blank)
  * A local script variable for pipeline.sh that can be set to have qsub automatically
    email you when a job is done. The string should include both -m and -M options, e.g.,
    MAIL_OPTIONS="-m bea -M <email@address.com>"

When recasing, you may want to set the port (RECASER_PORT).

All files are marked by a timestamp; the default is YYYYMMDD-HHMMSS (24hr clock)
but you can control is with TIMESTAMP. This allows you to start/stop the pipeline 
at various stages.

The pipeline depends on a single initial file. If the input is a directory, 
$DIR_TO_SINGLE_FILE applies. If not: if sym_link_okay=1, then we create a symlink 
to the given file; otherwise, we copy it.

Setting real_run=false will just provide a dry-run.

There are a number of others, which will be documented with time.