cfmrp/mtool

mtool crashes on UCCA

alexanderkoller opened this issue · 9 comments

Mtool crashes when evaluating UCCA analyses in which the token ranges go beyond the length of the input string. This happens when the input string contains non-ASCII characters which are expanded to multiple characters by mtool.

It would be really good if this bug could be fixed very quickly, because we're currently flying blind with respect to how well we're doing on UCCA.

To reproduce, use the MRP file below (and name it bad_ucca.txt). The example is taken from the UCCA training data. Then run mtool as follows:

hilbert:mtool koller$ python main.py --read mrp --score mrp --gold bad_ucca.txt bad_ucca.txt
Traceback (most recent call last):
  File "main.py", line 342, in <module>
    main();
  File "main.py", line 206, in main
    id = arguments.id, n = arguments.n, i = arguments.i);
  File "main.py", line 88, in read_graphs
    for graph in graphs: graph.normalize(normalize);
  File "/Users/koller/Documents/workspace/mtool/graph.py", line 433, in normalize
    node.normalize(actions, self.input);
  File "/Users/koller/Documents/workspace/mtool/graph.py", line 74, in normalize
    for anchor in self.anchors: trim(anchor, input);
  File "/Users/koller/Documents/workspace/mtool/graph.py", line 63, in trim
    while j > i and input[j - 1] in score.core.PUNCTUATION: j -= 1;
IndexError: string index out of range

This is bad_ucca.txt:

{"id": "291046-0001", "framework": "ucca", "flavor": 1, "time": "2019-07-17 (10:43)", "version": "0.9", "input": "Hams on Friendly … RIP", "nodes": [{"anchors": [{"from": 0, "to": 4}], "id": 0, "label": "hams", "properties": [], "values": []}, {"anchors": [{"from": 5, "to": 7}], "id": 1, "label": "on", "properties": [], "values": []}, {"anchors": [{"from": 8, "to": 16}], "id": 2, "label": "friendly", "properties": [], "values": []}, {"anchors": [{"from": 17, "to": 20}], "id": 3, "label": "...", "properties": [], "values": []}, {"anchors": [{"from": 21, "to": 24}], "id": 4, "label": "rip", "properties": [], "values": []}, {"id": 4}, {"id": 5}, {"id": 6}], "edges": [{"source": 5, "target": 1, "label": "A"}, {"source": 5, "target": 2, "label": "S"}, {"source": 6, "target": 5, "label": "A"}, {"source": 5, "target": 3, "label": "A"}, {"source": 6, "target": 0, "label": "S"}, {"source": 6, "target": 4, "label": "U"}]}
oepen commented

thanks for reporting this issue, alexander! i have just committed a revision that makes mtool handle invalid anchoring more robustly:

./main.py --read mrp --trace --score mrp --gold data/score/ucca/koller.mrp data/score/ucca/koller.mrp 
{'from': 21, 'to': 24} (P)--> <21:22> (P)
{'from': 21, 'to': 24} (P)--> <21:22> (P)

in a nutshell, "from" and "to" values that fall outside the underlying "input" string will now be adjusted to range between 0 and the length of the string.

also, the validator has been extended to flag such cases:

./main.py --read mrp --validate all data/score/ucca/koller.mrp 
validate(): [E] graph #291046-0001; node #4: invalid anchor: {'from': 21, 'to': 24}

however, your sample graph makes me wonder whether you may be mis-interpreting the training graphs and companion trees? your "input" value is not the correct string for item 291046-0001, because it is only 22 characters long, whereas the original string has a length of 24. your string appears to have normalized the original three-character sequence ... to the one-character Unicode ellipsis symbol (U+2026). input strings on parser outputs submitted for evaluation must always correspond to the original "input" values distributed for the task (i.e. the training graphs or evaluation strings); otherwise, the anchoring becomes incomparable.

did you maybe arrive at the "input" value in your example graph by detokenizing the morpho-syntactic companion trees? i would be skeptical of such an approach, and if one were to go down this road one would have to be very deliberate about keeping track of the original sub-string ranges for each token.

in the companion data, the tokenizer has normalized different typographic conventions (e.g. LaTeX-style two-character `` to Unicode , -- to , or three-character ... to ) and disambiguated some punctuation marks. the "input" strings on different MRP files are ‘raw’, in the sense of coming directly from the underlying corpora; thus, they exhibit different conventions for quote marks, dashes, ellipses, and such. disambiguation of ‘straight’ quote marks requires access to the immediate string context, prior to tokenization. therefore, our tokenizer applies this disambiguation and further maps to a uniform, Unicode representation of various directional punctuation marks, etc. sometimes it will even be appropriate to consider the sequence . . . (with spaces, as it occurs in the raw WSJ texts) as a single token. in all of these cases, the anchoring on the nodes in the companion trees will reflect the original string, e.g. a sub-string of length six for the last ellipsis example.

in my experience, the normalization and disambiguation of tokens in the companion trees can be helpful to the parser. but it inevitably leads to mismatches between the length of the (normalized) token and its anchoring into the (original) string. therefore, if using our companion trees as parser inputs, i expect one should make sure to carry forward the corresponding anchoring throughout the parser. otherwise, one would have to recompute sub-string ranges against the original (unnormalized) "input" string, which would require special-coding for the various normalizations applied in the tokenizer.

Thank you for taking care of this so quickly!

Yes, the U+2026 comes from the companion data. We use the tokens from the companion data as input to our parser because there are no tokenizers on the whitelist, so tokenizing (and POS tagging and lemmatizing) the inputs from input.mrp is not really feasible unless we implement an entire low-level preprocessing toolchain from scratch. (Or do you see another option that I'm missing?)

It is quite inconvenient from our perspective that the preprocessing that went into the computation of the companion data causes such mismatches. We have been spending quite a bit of time on working around low-level preprocessing issues that could otherwise have been spent on improving the semantic parser.

Is there documentation anywhere that would spell out what normalizations the tokenizer has carried out in creating the tokens for the companion data?

oepen commented

using our companion tokenization (and lemmas and tags) as parser input is fine (the intended use of the companion package, for teams not starting from raw strings). but then you must also use the anchoring from the companion package and preserve our original "input" strings. it is the detokenizing of the companion trees that i meant to advise against.

the basic philosophy of the normalization applied in the tokenizer is discussed in the paper referenced in the README.txt of the companion package; more low-level documentation would have to be the actual tokenization rules (regular expressions) and generous comments:

http://svn.delph-in.net/erg/tags/1214/rpp/tokenizer.rpp
http://svn.delph-in.net/erg/tags/1214/rpp/quotes.rpp

the cases mentioned already cover what is normalized across different typographic conventions: quote marks, dashes, and ellipses. for all i recall, the CoreNLP tokenizer applies the same normalization and disambiguation, only its default target used to be the LaTeX-style conventions used in the PTB, rather than unicode.

—much like connecting laptops and projectors, tokenization remains one of the grand challenges in computing; when dealing with texts from multiple sources and conventions, i believe our string-level normalization should benefit your parser.

Just to clarify: When you say "detokenizing of the companion trees", you mean concatenate the tokens in the companion tree and then use that as the "input" field of our parser output?

oepen commented

yes, whatever process that had led to the invalid "input" value in your ‘bad’ MRP file, which appeared to reflect the normalzed tokens rather than the original string.

Okay, fair enough. We will check this on our side.

However, there remains the problem that at least for UCCA, the token ranges in the graph don't necessarily seem to match those of any tokens in the companion data. For instance in #509008 in wiki.mrp, the graph looks like this:

wiki.mrp:{"id": "509008", "flavor": 1, "framework": "ucca", "version": 0.9, "time": "2019-04-11 (22:04)", "input": "Released the same year and containing re-recorded tracks from Omikron, his album 'Hours ...' featured a song with lyrics by the winner of his \"Cyber Song Contest\" Internet competition, Alex Grant.",
...
{"id": 15, "anchors": [{"from": 81, "to": 82}]},
{"id": 16, "anchors": [{"from": 82, "to": 87}]},
{"id": 17, "anchors": [{"from": 88, "to": 91}]},
...

And the companion data looks like this:

{"id": "509008", "flavor": 0, "framework": "conllu", "version": 0.9, "time": "2019-05-21 (08:36)", "input": "Released the same year and containing re-recorded tracks from Omikron, his album 'Hours...' featured a song with lyrics by the winner of his \"Cyber Song Contest\" Internet competition, Alex Grant.",
...
{"id": 13, "label": "’Hours", "properties": ["lemma", "upos", "xpos"], "values": ["’hour", "NOUN", "NNS"], "anchors": [{"from": 81, "to": 87}]},
{"id": 14, "label": "…", "properties": ["lemma", "upos", "xpos"], "values": ["…", "PUNCT", ","], "anchors": [{"from": 87, "to": 90}]},
{"id": 15, "label": "’", "properties": ["lemma", "upos", "xpos"], "values": ["’", "PUNCT", "''"], "anchors": [{"from": 90, "to": 91}]},
...

Is there a defined way in which we should map the token ranges of the tokens in the companion data (= our input) to the expected token ranges in the UCCA annotation? Maybe I missed something. If not, I'd also be happy to open a new issue and we can discuss this further there.

oepen commented

it appears that for this example you are looking at the superseded version of the UCCA training data? in the update to the UCCA graphs (see the link on the task web site and repeated announcements by email and in the ‘updates’ web page), the "input" value is the same as in the companion package.

Oops, you're right. As far as I can tell, the updated UCCA corpus fixed that. Thanks!