A set of Python scripts to segment text into Elementary Discourse Units.
The scripts are designed to work with the (conll-format) data from the 2019 DISRPT shared task (https://github.com/disrpt/sharedtask2019). The randomforest and lstm approaches perform the same task, but score different on the corpora from the shared task.
baseline.py
runs the baseline system, assuming a segment boundary at the start of every sentence, and if the -c
option is present, also after every comma. Specify the path to the (conll-format) test file and the labels are predicted and written to a file suffixed .baseline.predicted next to the original test file.
randomforest.py
trains and runs a randomforest classifier (predicts on the test set). Specify the paths to the train, dev and test file (train and dev are thrown together for training) with the --train
, --dev
and --test
options and the labels are predicted and written to a file suffixed .randomforest.predicted next to the original test file.
lstm.py
trains and runs an lstm (predicts on the test set). Specify the paths to the train, dev and test file (train and dev are thrown together for training) with the --train
, --dev
and --test
options. Additionally, specify a value for the --mode
option, either int
or ext
. If int
is used, corpus-internal embeddings are used and the predicted labels are written to a file suffixed .lstm_intembs.predicted next to the original test file. If ext
is used, pre-trained embeddings are used and the predicted labels are written to a file suffixed .lstm_extembs.predicted.
Because the labels for the PDTB-style corpora in the shared task differ (Seg=(B|I)-Conn
instead of BeginSeg=Yes
for all others), post_edit_pdtb_output.py
modifies the labels accordingly. This script thus is only needed for PDTB-style corpora.
Install all python requirements (pip3 install -r requirements.txt
).
For randomforest.py
, for the languages supported by the Stanford LexicalizedParser (German, English, French, Spanish and Chinese), this parser is used to extract features based on the (constituency) tree. Download the parser here: https://nlp.stanford.edu/software/lex-parser.shtml#Download
Then specify the paths accordingly in the config.ini file (the stanfordParser, stanfordModels and path properties) and make sure java (8) is installed. Note that obtaining the parse trees can take a while (up to several hours). For the non-proprietary corpora in the shared task, the parse trees for all sentences are stored in the picklejar folder and read from there, speeding up the process considerably.
For lstm.py
, you have a choice of either using external/pre-trained embeddings, or get the embeddings from the corpus itself. The bevaviour can be controlled by the --mode
(or -m
) option, which accepts either int
or ext
as a value. External embeddings can be downloaded from here: https://fasttext.cc/docs/en/crawl-vectors.html (the .vec files, i.e. cc.de.300.vec for German).
Then specify the paths according in the config.ini file (the embeddings section).