You'll need some software and data to run the convert-redwoods.sh
and/or mrs_to_penman.py
commands.
-
pip3 install pydelphin
-
svn co http://svn.delph-in.net/erg/trunk/tsdb/gold profiles
./convert-redwoods.sh
If you want to try different settings, edit parameters.json
and add or
remove constraints. See below for a description of the constraints.
-
For Linux:
wget http://sweaglesw.org/linguistics/ace/download/ace-0.9.25-x86-64.tar.gz -q -O - | tar xz
For Mac:
wget http://sweaglesw.org/linguistics/ace/download/ace-0.9.25-osx.tar.gz -q -O - | tar xz
You can install this to a suitable location by, e.g., moving the
ace-0.9.25/
directory into/opt/
and by addingPATH=/opt/ace-0.9.25/:"$PATH"
to.bashrc
. Alternatively, use the--ace-binary
option to themrs_to_penman.py
command. -
art (recommended)
Linux only:
wget http://sweaglesw.org/linguistics/libtsdb/download/art-0.1.9-x86-64.tar.gz -q -O - | tar xf
-
For Linux:
wget http://sweaglesw.org/linguistics/ace/download/erg-1214-x86-64-0.9.25.dat.bz2-q -O - | bunzip2 > erg-1214-0.9.25.dat
For Mac:
wget http://sweaglesw.org/linguistics/ace/download/erg-1214-osx-0.9.25.dat.bz2 -q -O - | bunzip2 > erg-1214-0.9.25.dat
For sentence data (one sentence per line), you can either pipe sentences
in via stdin or direct --input
to a file containing sentences. In
either case, a grammar file is required (e.g. the ERG).
cat sentences.txt | mrs_to_penman.py --grammar erg-1214-0.9.25.dat
or
mrs_to_penman.py --grammar erg-1214-0.9.25.dat --input sentences.txt
If you have a parsed full profile (e.g. using art
), you can point
--input
to the profile's directory. The --grammar
option is not
required in this case.
art -a 'ace -g erg-1214-0.9.25.dat' path/to/profile/
[..]
mrs_to_penman.py --input path/to/profile
Parsing with art
then converting a whole profile will be faster than
parsing and converting sentence data, and probably more robust, too.
If the data includes very long or complicated sentences, processing can
take some time. Use -n1
(the default) to only unpack the top result
per input and --timeout=1
to limit processing to 1 second (if it takes
longer, no results will be returned for that sentence). These options
can be specified on mrs_to_penman.py
or in the -a
value of art
(e.g.: art -a 'ace -g erg-1214-0.9.25.dat -n1 --timeout 1' path/to/profile
).
--grammar
- path to a grammar file compiled with ACE--input
- path to sentence data; if a file, file is 1 sentence per line, and if a directory, directory is a profile; if not given, read stdin as though a file-n
- maximum number of results per input (default: 1)--parameters
- path to a JSON file for conversion parameters--ace-binary
- path to the ACE binary (default: ace)--timeout
- time to allow for parsing each item, in seconds (default: no limit)
The --parameters
option takes a path to a JSON file with information
used to customize the PENMAN graphs written by the tool. There are three
main ways of doing this: (1) allowing (whitelisting) relations, (2)
dropping (blacklisting) entire nodes, and (3) modifying attribute values
with regular expressions. If no parameters file is given, then all
possible information is encoded in the graphs.
{
"allow_relations": { ... },
"drop_nodes": [ ... ],
"substitute_attribute_value": { ... },
"default_attribute_value": "..."
}
Allowing relations has four subcategories: (a) global allow, (b) allow for individuals ("x"; nouny things), (c) allow for eventualities ("e"; verby things), and (d) allow for specific node types. For (a)--(c), the value is a simple list of relations that are allowed for that category. For (d), the value is a mapping of node types to lists of relations.
{
"allow_relations": {
"global": [
"ARG1-NEQ", "ARG1-EQ", ...
],
"x": [
"NUM", ...
],
"e": [
"TENSE", ...
],
"predicate": {
"pron": [
"PERS", "NUM", "GEND", ...
],
...
}
}
}
}
Dropping nodes takes a simple list of node types. Any triple with a source or target anchored in a node of that type is dropped.
{
"drop_nodes": [
"udef_q",
"pronoun_q"
]
}
Attribute value substitutions have a key for a relation, then a list of
(match, substitution) pairs to apply in order. The pairs are processed
as regular expressions, so regular expression operators, including
backreferences, are allowed. Character escapes need to be
doubly-escaped. Because an empty string for an attribute value can cause
a malformed graph, the default_attribute_value
key should specify what
value to use in the case that substitutions delete the entire value.
{
"substitute_attribute_value": {
"predicate": [
["\\(", "["],
["\\)", "]"],
]
},
"default_attribute_value": "..."
}
It's possible that some parameter values could result in a graph that is
disconnected. In these cases, the graph will not be serialized (and
you should see an error message). Note that it's also possible to have a
disconnected graph originally.