This system is a syntax-semantics interface between Universal Dependencies (UD) and Discourse Representation Structures (DRS), as used in the Parallel Meaning Bank (PMB) project. The primary DRS target format is the Simplified Box Notation (SBN). The following images show the basic idea for the sentence Tracy lost her glasses. with the UD parse on the left and the target SBN graph on the right:
UD-Boxer is the system created for my thesis for the MSc Information Science at the University of Groningen.
Note that the thesis-hand-in-branch
is the branch that corresponds to the exact state of the system described my thesis.
Additional development will be done on other branches.
These changes will not be merged to the hand in branch!
- Thesis: https://arts.studenttheses.ub.rug.nl/31168/1/master_thesis_wessel_poelman.pdf
- Peer-reviewed COLING-2022 paper about the system: https://aclanthology.org/2022.coling-1.367/
A script with the required steps is included: install.sh
.
Note however that this is to illustrate the steps needed, you probably need to edit it in order to work for your system.
- Create and activate a virtual environment (Python 3.9.7 was used in development) (Optional)
- Install
graphviz
to generate visualizations. (Optional)- On Debian-based systems:
apt install graphviz libgraphviz-dev pkg-config
- On Debian-based systems:
- Install GREW
- Update December 2022: the newer GREW versions break the old graph format GREW had. Make sure to use
opam install grew.1.8.1 grewpy.1.8.0 libcaml-grew.1.9.2
these versions to make UD-Boxer work as is. I will work on updating to the new format if I have time. - Follow the instructions on https://grew.fr/usage/install/ (NOTE: if you have trouble, make sure
ocamlbuild
is installed:opam install ocamlbuild
) - Make sure
opam
is active by runningopam --version
- If
opam
is not avaible, try activating it witheval $(opam env)
- Grew versions used in development
- Update December 2022: the newer GREW versions break the old graph format GREW had. Make sure to use
- Install dependencies with
pip install -r requirements/requirements.txt
requirements/requirements-ud.txt
contains alternative UD parsing systems (Optional)requirements/requirements-dev.txt
contains development libraries for testing, formatting etc. (Optional)
Run fix_all.sh
to format and test the project (dev requirements needed).
The data comes from the Parallel Meaning Bank project (https://pmb.let.rug.nl/).
- This project uses version 4.0.0 of the PMB dataset, which can be downloaded from here: https://pmb.let.rug.nl/data.php
- There are some minor issues with this version of the dataset that will be fixed in future versions:
- There are several synset ids that contain whitespace
- There are some empty
*.sbn
documents - There are several cyclic SBN graphs (they should all be Directed Acyclic Graphs (DAGs))
- Some SBN files contain constants that cannot be distinguished from indices. For example:
en/silver/p15/d3131
- Most of the docs with these issues are listed in the
misc
folder. - The system warns or errors when it encounters these issues.
The data splits only contain ids for gold data per language.
The splits are based on the experiment
files in the PMB: exp_data_4.0.0.zip
from https://pmb.let.rug.nl/releases/.
Note that not all docs are usable due to the above mentioned issues, so the actual counts are slightly lower.
Split | English | Italian | German | Dutch |
---|---|---|---|---|
train | 7668 | 685 | 1738 | 539 |
test | 1048 | 461 | 547 | 491 |
dev | 1169 | 540 | 559 | 437 |
eval | 830 | - | - | - |
The PMB has a specific file structure that is handy to understand when using the system.
The data/test_cases
directory has a similar layout and can be used to see what is happening.
What the specific files mean will be explained in the usage section.
<language-1> /
<p-collection-1> /
<document-1> /
--- The existing PMB files per document ---
<lang>.drs.clf
<lang>.drs.sbn
<lang>.drs.xml
<lang>.met
<lang>.parse.tags
<lang>.raw
<lang>.status
<lang>.tok.iob
<lang>.tok.off
--- These items get added when using all options provided by main.py ---
<lang>.ud.<ud-system>.conll
<lang>.drs.penman
<lang>.drs.lenient.penman
viz /
<lang>.drs.png
<lang>.ud.<ud-system>.png
--- Predictions from pmb_inference.py when using all options get stored here ---
predicted /
output.penman
output.lenient.penman
output.sbn
output.png
<document-2> /
...
<p-collection-2> /
...
<language-2> /
...
To transform a single sentence to DRS in SBN format, run:
python inference.py --sentence "Tracy lost her glasses." --output_dir ./result
This stores a conll
file of the UD parse and the generated sbn
file in ./result
.
If you already have a UD parse in conll
format at hand, run:
python inference.py --ud <path-to-conll-file> --output_dir ./result
There are a number of additional tools and options included apart from the main graph transformations:
python inference.py \
--sentence "Tracy lost her glasses." \
--output_dir ./result \
--store_visualizations \
--store_penman
This stores an AMR-like output of the SBN in Penman notation as well as visualizations of the UD parse and the SBN graph.
The *.drs.penman
file includes everything, the *.drs.lenient.penman
file does not include the sense number.
The regular Penman output indirectly also targets word sense disambiguation when scoring the output (with SMATCH for instance).
The lenient option does not do this, but does reward correct lemmas and parts of speech for a given synset.
For more details and additional options, run inference.py --help
.
The PMB does not come with UD parses or SBN graphs in Penman notation.
The script main.py
can be used to interact with the PMB to gather information, store required files, train certain components used for inference and generate visualizations.
python main.py --starting_path <path-to-pmb-dataset> \
--store_ud_parses \
--search_dataset \
--extract_mappings \
--error_mine \
--store_visualizations \
--store_penman
This will recursively go through all PMB docs, do all possible operations on the data and generate all required files to run inference.
For more details and additional options, run main.py --help
.
To store the results within the PMB dataset file structure and evaluate the generated output, you can use pmb_inference.py
.
python pmb_inference.py -p data/test_cases -r results.csv
This will recursively go through the provided path, generate SBN graphs with the conll
files it finds and compares these with the .penman
file in the same folder.
It will store the predicted .penman
files in the dataset and write the evaluation scores to results.csv
for later analysis.
By default, the system will use 16 threads to go through the dataset and generate results.
Again, there are several additional options:
python pmb_inference.py -p <path-to-pmb-dataset> \
--results_file results.csv \
--max_workers 32 \
--clear_previous \
--store_visualizations \
--store_sbn
This will go through the dataset with 32 workers, writing the results to results.csv
, clearing previously predicted files if they exist, storing visualizations of the generated output as well the generated SBN itself.
For more details and additional options, run pmb_inference.py --help
.
If you want to evaluate existing AMR-like (Penman) parses without running the whole inference pipeline, you can use SMATCH via mtool
(which is included in the requirements):
mtool --read amr --score smatch --gold <path-to-gold> <path-to-test>
An example script on how to use this with SBN files is also included: example_scripts/evaluate.py
.
A number of notebooks have been used in the development and analysis of ud-boxer
.
These also include some experiments and miscellaneous parts.
In order to not clog up this repository, these have been moved to their own repository.
This repo can be accessed as a git submodule in the current repo, as well has directly: https://github.com/WPoelman/ud-boxer-notebooks.
Note that you probably need to move the notebooks to the ud-boxer
project root if they need access to ud-boxer
components.
Another option is to change your Python path.
We will add the option to install ud-boxer
as a package at a later moment.
- Support enhanced UD annotations (need CoreNLP binding: https://stanfordnlp.github.io/CoreNLP/depparse.html or keep an eye on this: stanfordnlp/stanza#359) these are essential for certain case markings.
- Generated SBN does not adhere to the spec entirely, in particular when it comes to constants. The evaluation in Penman notation is not bothered by this since everything gets consistently quoted there:
- Names are not quoted properly with
"John Doe"
, now they are formatted asJohn_Doe
. This requires keeping track of an additional flag/type indicating if something is a name. I am a bit hesitant to do this since we might introduce a NER component that will help with this a lot. - Year constants are not quoted properly like
'2022'
, now they are formatted as2022
, again something for a possible NER component since date parsing in general is not ideal currently.
- Names are not quoted properly with
- p00/d0004:
entity
that combines multiple subtypes - p00/d0801: multiple boxes
- p00/d1593: negation
- p00/d2719: case marking / pivot -> extended UD needed!
- p03/d2003: named entity 'components' combining in single item
- p04/d0778: double negation
- p04/d1646: connect owner