Implict Argument Prediction with Event Knowledge
Code for the NAACL 2018 paper: Implicit Argument Prediction with Event Knowledge.
Quick Links
Dependencies
- python 2.7
- theano >= 0.9
- numpy, nltk, gensim, lxml
Usage
- Clone the repository.
git clone git@github.com:pxch/event_imp_arg.git ~/event_imp_arg
cd ~/event_imp_arg
- Set environment variables.
export PYTHONPATH=~/event_imp_arg/src:$PYTHONPATH
export CORPUS_ROOT=~/corpora
Dataset
By default, assume the working directory is ~/event_imp_arg
.
Prepare Training Data
Assume the directory to store all training data is ~/corpora/enwiki/
.
- Create directories.
mkdir -p ~/corpora/enwiki/{extracted,parsed,scripts,vocab/raw_counts}
mkdir -p ~/corpora/enwiki/{word2vec/{training,space},indexed/{pretraining,pair_tuning}}
- Download the English Wikipedia dump. Note that the
enwiki-20160901-pages-articles.xml.bz2
dump used in this paper is currently unavailable from the website. But the results shouldn't differ too much if you use a more recent dump.
pushd ~/corpora/enwiki
wget https://dumps.wikimedia.org/enwiki/20160901/enwiki-20160901-pages-articles.xml.bz2
popd
- Extract plain text from Wikipedia dump using WikiExtractor. This should give you a list of bzipped files named
wiki_xx.bz2
(starting fromwiki_00.bz2
) in~/corpora/enwiki/extracted/AA/
.
git clone git@github.com:attardi/wikiextractor.git ~/corpora/enwiki/wikiextractor
python ~/corpora/enwiki/wikiextractor/WikiExtractor.py \
-o ~/corpora/enwiki/extracted -b 500M -c \
--no-templates --filter_disambig_pages --min_text_length 100 \
~/corpora/enwiki/enwiki-20160901-pages-articles.xml.bz2
- Split each document into a separate file, with each paragraph in a single line. As the total number of documents is too large, we store them in multiple subdirectories (with 5000 documents each by default). The following commands should store all documents in
~/corpora/enwiki/extracted/documents/xxxx/
, wherexxxx
are subdirectories starting from0000
.
python scripts/split_wikipedia_document.py \
~/corpora/enwiki/extracted/AA ~/corpora/enwiki/extracted/documents \
--file_per_dir 5000
- Download the Stanford CoreNLP tool (ver 3.7.0) and set
CLASSPATH
for java.
pushd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip
rm stanford-corenlp-full-2016-10-31.zip
popd
for j in ~/stanford-corenlp-full-2016-10-31/*.jar; do export CORENLP_CP="$CORENLP_CP:$j"; done
- Parse each document from step 4 one line at a time (representing one paragraph), with the following parameters:
java -cp $CORENLP_CP -Xmx16g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner,depparse,mention,coref \
-coref.algorithm statistical -outputExtension .xml
- Note that the parsing is better done in a cluster with multiple nodes working simultaneously, otherwise it's going to take very long. So I'm not providing detailed commands here.
- It also might be better to bzip the output .xml files to save space (over 100G after compressing).
- The output files should be stored in
~/corpora/enwiki/parsed/xxxx/yyy/
, wherexxxx
andyyy
are subdirectories. (We use multi-level subdirectories as the number of paragraphs are even larger than the number of documents in step 4.)
- Generate scripts from CoreNLP parsed documents (for all subdirectories
xxxx
andyyy
).
python scripts/generate_event_script.py ~/corpora/enwiki/parsed/xxxx/yyy ~/corpora/enwiki/scripts/xxxx/yyy.bz2
- Count all tokens in the scripts (for all subdirectories
xxxx
), and build vocabularies for predicates, arguments (including named entities), and prepositions. The vocabularies are stored in~/event_imp_arg/data/vocab/
.
python scripts/count_all_vocabs.py ~/corpora/enwiki/scripts/xxxx/ ~/corpora/enwiki/vocab/raw_counts/xxxx
python scripts/sum_all_vocabs.py ~/corpora/enwiki/vocab/raw_counts data/vocab
- Generate word2vec training examples (for all subdirectories
xxxx
).
python scripts/prepare_word2vec_training.py ~/corpora/enwiki/scripts/xxxx/ ~/corpora/enwiki/word2vec/training/xxxx.bz2
- Train event-based word2vec model.
python scripts/train_word2vec.py \
--train ~/corpora/enwiki/word2vec/training \
--output ~/corpora/enwiki/word2vec/space/enwiki.bin \
--save_vocab ~/corpora/enwiki/word2vec/space/enwiki.vocab \
--sg 1 --size 300 --window 10 --sample 1e-4 --hs 0 --negative 10 \
--min_count 500 --iter 5 --binary 1 --workers 20
- [Optional] Generate autoencoder pretraining examples (for all subdirectories
xxxx
).
python scripts/prepare_pretraining_input.py \
~/corpora/enwiki/scripts/xxxx/ \
~/corpora/enwiki/indexed/pretraining/xxxx.bz2 \
~/corpora/enwiki/word2vec/space/enwiki.bin \
~/corpora/enwiki/word2vec/space/enwiki.vocab \
--use_lemma --subsampling
- Generate event composition training examples (for all subdirectories
xxxx
).
python scripts/prepare_pair_tuning_input.py \
~/corpora/enwiki/scripts/xxxx/ \
~/corpora/enwiki/indexed/pair_tuning/xxxx.bz2 \
~/corpora/enwiki/word2vec/space/enwiki.bin \
~/corpora/enwiki/word2vec/space/enwiki.vocab \
--use_lemma --subsampling --pair_type_list tf_arg \
--left_sample_type one --neg_sample_type one
Note: Step 7, 8, 9, 11, 12 are all described for a cluster environment where jobs can be paralleled among multiple nodes to speed up, however it can also be done on a single machine by looping through all subdirectories.
Prepare OntoNotes Evaluation Data
-
Download OntoNotes Release 5.0 from https://catalog.ldc.upenn.edu/ldc2013t19, and extract the content to
~/corpora/ontonotes-release-5.0
. -
Install the OntoNotes DB Tool included in the release. Note that there might be a function naming issue you need to fix in
ontonotes-db-tool-v0.999b/src/on/common/log.py
, changedef bad_data(...)
todef bad_data_long(...)
in line 180.
pushd ~/corpora/ontonotes-release-5.0/ontonotes-db-tool-v0.999b
python setup.py install
popd
- The
nw/wsj
corpus in OntoNotes only have a fraction of documents with gold coreference annotations, we need to filter out those without annotations.
./filter_ontonotes_wsj.sh
- Ontonotes only contains constituency parses, convert them to dependency parses using
UniversalDependenciesConverter
from Stanford CoreNLP. (Check step 5 in Prepare Training Data for configuring CoreNLP)
./convert_ontonotes_parse.sh bn/cnn
./convert_ontonotes_parse.sh bn/voa
./convert_ontonotes_parse.sh nw/xinhua
./convert_ontonotes_parse.sh nw/wsj
- Extract event scripts from OntoNotes documents, and build
OnShort
andOnLong
evaluation datasets indata/ontonotes/
.
python scripts/build_ontonotes_dataset.py data/ontonotes --suppress_warning
Evaluation
Evaluation on OntoNotes
Follow the steps in the following Jupyter Notebooks to reproduce the evaluation results on OntoNotes.
- notebooks/Ontonotes Evaluation (Baselines).ipynb
- notebooks/Ontonotes Evaluation (Event Composition Model).ipynb
Evaluation on G&C
Download the dataset by
python scripts/download_gc_dataset.py
You will need to have the following corpora ready in ~/corpora/
- penn-treebank-rel3: https://catalog.ldc.upenn.edu/ldc99t42
- propbank-LDC2004T14: https://catalog.ldc.upenn.edu/ldc2004t14
- nombank.1.0: https://catalog.ldc.upenn.edu/LDC2008T23
You will also need to have a CoreNLP parsed wsj corpus. (Check step 5 and 6 in Prepare Training Data)
Then follow the steps in the following Jupyter Notebook to reproduce the evaluation results on G&C.