Authors: Erwan Moreau, Ashjan Alsulaimani and Alfredo Maldonado
- Shared task website
- Shared Task data (gitlab)
- Our paper will be published in August, link coming soon (if I don't forget!).
This repository contains two distinct systems for detecting verbal multiword expressions from text. This short description assumes that the reader is familiar with the task; if not, please see link above.
This system attempts to exploit the dependency tree structure of the sentences in order to identify MWEs. This is achieved by training a tree-structured CRF model which takes into account conditional dependencies between the nodes of the tree (node-parent and possibly node-next sibling). The system is also trained to predict MWE categories. The tree-structured CRF software used is XCRF.
A robust sequential method which can work with only lemmas and morphosyntactic tags. It uses the Wapiti CRF sequence labeling software.
- libxml2 must be installed to compile the dep-tree system, including the source libraries (header files)
- on Ubuntu the most convenient way is to install the package
libxml2-dev
:sudo apt install libxml2-dev
.
- on Ubuntu the most convenient way is to install the package
- CRF++ must be installed and accessible via
PATH
- Wapiti must be installed and accessible via
PATH
- The shared task data can be downloaded or cloned from https://gitlab.com/parseme/sharedtask-data
- XCRF is also required but provided in this repository
From the main directory run:
source setup-path.sh
This will compile the code if needed and add the relevant directories to PATH
. You can add this to your .bashrc
file in order to have the PATH
set up whenever you open a new session.
From the directory dep-tree
:
train-test-class-method.sh -l sharedtask-data/1.1/FR/train.cupt -a sharedtask-data/1.1/FR/dev.cupt conf/minimal.conf model output
sharedtask-data
is the directory containing the official shared task data, as its name suggests (see link above); replace with the appropriate path.-l
"learn" option: indicates to perform training from the specified file-a
"apply" option: indicates to perform testing on the specified file- using configuration file
conf/minimal.conf
(see Configuration files in section Details below) model
will contain the model at the end of the processoutput
is the "work directory"; at the end of the testing process it contains:- The predictions stored in
<work dir>/predictions.cupt
- Evaluation results are stored in
<work dir>/eval.out
if-e
is used (see below).
- The predictions stored in
- if option
-a
is supplied,-e <training file>
can be used to perfom evaluation. The training file is required in order for the script (provided by the organizers) to count the cases seen in the training data. - To run the script from a different directory, one has to provide the path to the XCRF directory in the following way:
train-test-class-method.sh -o '-x dep-tree/xcrf-1.0/' -l sharedtask-data/1.1/FR/train.cupt dep-tree/conf/minimal.conf model-dir output-dir
CAUTION: RAM ISSUES. XCRF requires a lot of memory. Depending on the amount of training data, the number of features and the "clique level" option, it might crash even with as much as 64GB. Memory options can be passed to the Java VM (XCRF is implemented in Java) through option -o
:
train-test-class-method.sh -o "-j '-Xms32g -Xmx32g' -x /path/to/xcrf-1.0/" ...
Scripts are provided to allow batch processing. In order to train and test the system each time with a distinct config file and dataset, the script process-multiple-datasets.sh
can be used to generate the commands to run. This way the tasks can be started in parallel or any way convenient, ideally on a cluster.
# generate a few config files
mkdir configs; echo dep-tree/conf/basic.multi-conf | expand-multi-config.pl configs/
# generate the command to train and test for each dataset and each config file
process-multiple-datasets.sh sharedtask-data/1.1/ configs results >tasks
# split to run 10 processes in parallel
split -d -l 6 tasks batch.
# run
for f in batch.*; do (bash $f &); done
From the directory seq
:
seq-train-test.sh -l sharedtask-data/1.1/FR/train.cupt -a sharedtask-data/1.1/FR/test.cupt -e sharedtask-data/1.1/FR/train.cupt conf/example.conf model output
sharedtask-data
is the directory containing the official shared task data, as its name suggests (see link above); replace with the appropriate path.-l
"learn" option: indicates to perform training from the specified file-a
"apply" option: indicates to perform testing on the specified file- using configuration file
conf/example.conf
(see Configuration files in section Details below) model
will contain the model at the end of the processoutput
is the "work directory"; at the end of the testing process it contains:- The predictions stored in
<work dir>/predictions.cupt
- Evaluation results are stored in
<work dir>/eval.out
if-e
is used (see below).
- The predictions stored in
- if option
-a
is supplied,-e <training file>
can be used to perfom evaluation. The training file is required in order for the script (provided by the organizers) to count the cases seen in the training data.
Generating multiple configuration files:
crf-generate-multi-config.pl 3-4:8:5:1:1:C >seq.multi-conf
echo "labels=IO BIO BILOU" >>seq.multi-conf
mkdir configs; echo seq.multi-conf | expand-multi-config.pl configs/
- Columns 3 and 4 represent the lemma and POS tag, respectively.
- Alternatively, each column (feature) can be given separately, e.g.:
crf-generate-multi-config.pl 3:6:2:1:1:C 4:8:5:1:1:C >seq.multi-conf
- This allows the combination of different patterns for each column so it might improve the model, but at the cost of multiplying the number of combinations (hence increasing the computation time).
Generating the commands and executing the tasks in parallel:
seq-multi.sh -e -t test.cupt sharedtask-data/1.1/ configs/ expe >tasks
# split to run processes in parallel
split -d -l 700 tasks batch.
# run
for f in batch.*; do (bash $f &); done
The scripts are meant to be used with configuration files which contain values for the parameters. Examples can be found in the directory conf
. Additionally, a batch of configuration files can be generated using e.g.:
# generates a set of config files (written to directory 'configs')
mkdir configs; echo dep-tree/conf/large.multi-conf | expand-multi-config.pl configs/
In order to generate a different set of configurations, either customize the values that a parameter can take in conf/options.multi-conf
or use the -r
option to generate a random subset of config files, e.g.:
# generate a random 50 config files
mkdir configs; echo dep-tree/conf/large.multi-conf | expand-multi-config.pl -r 50 configs
The two approaches work with a sequential labelling scheme, as opposed to the numbering of the expressions by sentence used in the cupt
format provided in the shared task. Scripts are provided to convert between the two formats.
cupt-to-bio-labels IO sharedtask-data/1.1/FR/train.cupt fr-train.io
- Note that the conversion entails a simplification, i.e. loss of information: in the cases of overlapped or nested expressions, the program discards one of the expressions (the shortest).
- By default it adds the tokens corresponding to the discarded expression as if they belonged to the preserved expression. Alternatively, if
-k
is supplied, the tokens of the shortest expressions are not added to the other.
- By default it adds the tokens corresponding to the discarded expression as if they belonged to the preserved expression. Alternatively, if
- The labelling scheme must be one of:
IO
: only mark tokens as belonging to an expression or notBIO
: special markB
for the starts of a token; this allows the detection of multiple expressions by sentenceBILOU
: more sophisticated labelling scheme withL
for last token andU
for unit expressions.
- Categories:
- "joint": by default the program keeps the categories of the expressions as a suffix, thus generating a number of distinct labels up to three times the number of categories (if using BIO).
- "indep": option
-c <category>
makes the program focus on a single category of expressions and ignore the others. This allows the training of independent models, one by category. - "none": option
-i
makes the program ignore cateogries and process all the expressions as if they all belong to the same category.
bio-to-cupt-labels fr-train.io fr-train.cupt
- See also
merge-independent-labels
in order to merge categories back together after predicting them independently.