DPTSG January 2010, Matt Post -- The code contained in this project directory was used to obtain the results in our ACL 2009 short paper: "Bayesian learning of a tree substitution grammar" Matt Post & Daniel Gildea ACL 2009 If you want to learn how the code works, along with all the caveats, please see the file README.code. This file describes the project files, along with the steps needed to prepare all of the data to do Gibbs sampling. -- INSTALLATION AND WALK-THROUGH ------------------------------------- All files and code herein assume a base directory of $ENV{DPTSG}. You should set this in your profile, e.g., in bash, $ export DPTSG=$HOME/code/dptsg You will also need to put this directory in your Perl search path: $ export PERL5LIB+=:$DPTSG * List of programs in scripts/ ** oneline_treebank.pl Takes (on STDIN) a multi-line treebank tree and condenses it to a single line (on STDOUT). ** clean.pl Removes empty nodes and Treebank v2 information from node labels. ** build_lex.pl Builds the lexicon. This is used by most programs to determine which words should be converted to one of the ~ 60 unknown word classes. ** print_leaves.pl Takes a oneline tree and prints out its frontier. ** extract_lines.pl Takes a list of line numbers and a file, and prints out only lines from the file corresponding to those in the list of line numbers. ** annotate_spinal_grammar.pl Takes a treebank (one tree per line) and annotates it with the derivations from the spinal grammar, using the Magerman head selection rules. ** rule_probs.pl Produces a grammar from a combination of rule counts in the form <count> <rule> and oneline trees (these can be interspersed). ** extract_bod01_grammar.pl Samples random trees at various heights from a treebank, reproducing the "minimal subset" grammar of Bod. * Preparing the data ** Process the trees into a nice format. You need a copy of the WSJ portion of the Penn treebank. Gets rid of treebank2 info, removes NONE nodes and traces. $ cd $basedir $ mkdir data $ (cd /p/mt/corpora/wsj; cat {02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21}/*mrg) | \ ./oneline_treebank.pl > data/wsj.trees.02-21 $ cat data/wsj.trees.02-21 | ./clean.pl > data/wsj.trees.02-21.clean ** Generate the lexicon This determines which words get transformed into one of the ~ 60 unknown word bins. It writes (on STDOUT) a vocabulary file with lines of the form "ID WORD COUNT", for all words found in the training trees. $ ./build_lex.pl < data/wsj.trees.02-21.clean > data/lex.02-21 ** Annotate the training data with spinal annotations A TSG derivation tree is represented by prepending a '*' to the labels of nodes that are internal to a TSG rule, e.g., the tree (S (*NP (*DT the) (NN boy)) (*VP (VBD was)) (*. .)) would yield the rules (S (NP (DT the) NN) (VP VBD) (. .)) (NN boy) (VBD was) Produce the spinal annotation of the training data with the following command: $ ./annotate_spinal_grammar.pl data/lex.02-21 < data/wsj.trees.02-21.clean | > data/wsj.trees.02-21.clean.annotated You need this representation if you want to initialize the Gibbs sampler from the spinal derivations instead of the PCFG derivations, and also to extract the spinal baseline grammar. This initialization is recommended since it converges more quickly to better grammars. ** Generate the development and test sets. $ cat /p/mt/corpora/wsj/22/*.mrg | ./oneline_treebank.pl > data/wsj.trees.22 $ cat data/wsj.trees.22 | ./clean.pl > data/wsj.trees.22.clean $ cat data/wsj.trees.22.clean | ./print_leaves.pl > data/wsj.22.words $ cat /p/mt/corpora/wsj/23/*.mrg | ./oneline_treebank.pl > data/wsj.trees.23 $ cat data/wsj.trees.23 | ./clean.pl > data/wsj.trees.23.clean $ cat data/wsj.trees.23.clean | ./print_leaves.pl > data/wsj.23.words Get those sentences with forty words max: $ awk 'BEGIN { sentno=0 } { sentno++; if (NF <= 40) printf("%d\n",sentno); }' wsj.23.words > wsj.23.words.lines-max40 $ extract_lines.pl -l wsj.23.words.lines-max40 -f wsj.23.words > wsj.23.words.max40 $ awk 'BEGIN { sentno=0 } { sentno++; if (NF <= 40) printf("%d\n",sentno); }' wsj.22.words > wsj.22.words.lines-max40 $ extract_lines.pl -l wsj.22.words.lines-max40 -f wsj.22.words > wsj.22.words.max40 ** Generate the PCFG rule probs used for the base measure. $ cat data/wsj.trees.02-21.clean | ./rule_probs.pl -counts > data/rule_counts $ cat data/wsj.trees.02-21.clean | ./rule_probs.pl > data/rule_probs (alternately, the grammar could be produced directly from the counts: $ cat data/rule_counts | ./rule_probs.pl > data/rule_probs ) ** Generate the spinal grammar $ cat data/wsj.trees.02-21.clean.annotated | ./rule_probs.pl -counts > data/spinal_counts $ cat data/rule_counts data/spinal_counts | ./rule_probs.pl > data/spinal_probs ** Generate Bod's grammar $ cat data/wsj.trees.02-21.clean | ./extract_bod01_grammar.pl > data/bod01.rules The file 'bod01.rules" prepends each line with the height of that rule. To build the grammar, remove that field, add in the PCFG counts, and normalize: $ (perl -pe 's/^\d+ //' data/bod01.rules; cat data/rule_counts) | ./rule_probs.pl > data/bod01.prb * Gibbs sampling The Gibbs sampler comprises the files tsg.pl, the generic sampler code in Sampler.pm, and the TSG-specific sampling code in Sampler/TSG.pm. As input, it takes a treebank with arbitrary TSG derivations and a PCFG used to score the base measure. Each iteration's output is written to a subdirectory whose name is the iteration number, and whose contents are the corpus state at tne end of that iteration and the counts of TSG rules extracted from that corpus derivation state. The sampler can be interrupted at any time, and will restart from the last completed iteration (it will be safer to remove the last iteration's directory if it is incomplete). Usage: tsg.pl -iters number of iterations -corpus corpus to initialize counts from (unless picking up from an interrupted run) -alpha the value of alpha for the DP prior -stop the stop probability for the DP base measure -pcfg the pcfg grammar used for the DP base measure -rundir the run directory (default = pwd) -two sample two nodes at a time (default = sample one) [many more options are available] e.g., $ tsg.pl -corpus $basedir/data/wsj.trees.02-21.clean -iters 500 To extract a Gibbs-sampled grammar, use one of: - extract a summed grammar (from counts of first $i iters) (for num in $(seq 1 $i); do bzcat $num/counts.bz2 done) | $basedir/scripts/rules_probs.pl > 1-$i.prb - extract a point grammar (from a single iteration $i) bzcat $i/counts.bz2 | $basedir/scripts/rule_probs.pl > $i.prb * Parsing You can now parse with these grammars with your favorite CKY parser. I have modified Mark Johnson's parser to work with these grammars; please feel free to email me for a copy if I haven't already posted the source to my website. * Evaluation Use evalb with the COLLINS.prm file. http://nlp.cs.nyu.edu/evalb/