UCCA: Broken token ranges in companion data
Closed this issue · 15 comments
In the companion data, we have the following phenomenon:
#291046-0001
1 Hams ham NOUN NNS _ 0 root _ TokenRange=0:4
2 on on ADP IN _ 3 case _ TokenRange=5:7
3 Friendly friendly PROPN JJ _ 1 nmod _ TokenRange=8:16
4 … … PUNCT , _ 1 punct _ TokenRange=17:20
5 RIP RIP PROPN NNP _ 1 parataxis _ TokenRange=21:24
The string for which is "Hams on Friendly … RIP". In the TokenRanges, … counts as three characters, but it's only one in the input. This makes mtool crash in the evaluation because it tries to go to a string position that doesn't exist. Right now, we just copy the token ranges to be our anchors.
Is that something we want to fix or do we complain and ask for better companion data?
I had forwarded this from Omri before, although I don’t think it provides help:
Regarding the tokenization protocol: For the EWT and the WSJ parts, we simply kept the tokenization found in the corpora (i.e., UD and PTB tokenization respectively). For the Wiki corpus, tokenization was done with a simple tokenizer that was later hand-corrected. There are still probably a few mistakes in the tokenization that haven't been spotted by the annotators -- thanks for referring us to them!
The companion data was automatically tokenized and processed with UDPipe (see here: http://mrp.nlpl.eu/index.php?page=4#companion). Since we do not provide gold tokenization for the test data, but only automatically processed data (also with UDPipe), we decided to distribute automatic tokenization in the companion package for the training data too. This means that there are inevitably discrepancies between the tokenization in the companion data, and the manual tokenization (albeit with a few errors, evidently) included in the UCCA gold training data.
In order to address these non-uniformities (and the non-uniform tokenization between the meaning representations), the evaluation tool is designed to be somewhat forgiving to tokenization errors, and only penalizes such incongruencies at the anchoring level.
Happy to write the organizers again to complain, since Omri notes there are errors!
Well, there are two points here:
- this is not question of gold tokenization vs automatic tokenization but an error in the automatic tokenization
- I think it's mtool's attempt to be "forgiving" that crashes the evaluation, the error occurs in some function that is called "normalize" that steps through the string and then tries to access a position that doesn't exist.
I guess, the clean way would be the organizers providing automatic tokenization with the token ranges that actually match in the input string what they are supposed to, but maybe it can also be solved in mtool. Or in some hacky fix on our side.
Whatever happens, mtool should not crash. So this is definitely something the organizers need to fix somehow.
Meanwhile, I can ask Weiwei how they dealt with that in Beijing.
But to clarify: what exactly is the discrepancy? Does the string do a special Unicode thing where "..." is a single character? And then the UDPipe somehow unfolds that into three dots to calculate the token range?
But to clarify: what exactly is the discrepancy? Does the string do a special Unicode thing where "..." is a single character? And then the UDPipe somehow unfolds that into three dots to calculate the token range?
Yes, I think that's what's happening.
Facepalm.
What specifically would you like me to request from the organizers?
Can somebody create an issue in mtool with an example that makes mtool crash because an anchoring goes beyond the length of the input string?
What specifically would you like me to request from the organizers?
Ideally, I would like to get updated companion data where all token ranges are valid positions in the input string. If they cannot or don't want to make an update so close to the deadline, we'll have to do it ourselves, in some way.
Email sent.
Can somebody create an issue in mtool with an example that makes mtool crash because an anchoring goes beyond the length of the input string?
I can create such an issue, but can you describe precisely how to reproduce the crash? Could you copy & paste the MRP input for the sentence you mentioned above? What call to mtool causes the crash?
python3 main.py --read mrp --score mrp --gold bad_ucca.txt bad_ucca.txt
bad_ucca.txt
The nodes shouldn't contain labels in the case of UCCA, but I don't think it has an effect on whether it crashes or not.
hi lucia,
the ’input‘ field in the MRP file is the original string, and i would be very surprised if the corresponding sub-string were not the three-character sequence ’...‘. the tokenization has normalized and disambiguated quote marks (including multi-character LaTeX-style), dashes, ellipses, and such. it sounds as if you end up with invalid ’input‘ values in your parsing results? those strings must be unchanged from what is in the MRP files; you cannot manufacture an ’input‘ by detokenizing the companion trees.
more later tonight! oe
Stephan just pushed a new version of mtool which fixes this crash, so I'm closing this issue:
hilbert:mtool koller$ python main.py --read mrp --score mrp --gold bad_ucca.txt bad_ucca.txt
{"n": 1,
"exact": 1,
"tops": {"g": 0, "s": 0, "c": 0, "p": 0.0, "r": 0.0, "f": 0.0},
"labels": {"g": 5, "s": 5, "c": 5, "p": 1.0, "r": 1.0, "f": 1.0},
"properties": {"g": 0, "s": 0, "c": 0, "p": 0.0, "r": 0.0, "f": 0.0},
"anchors": {"g": 5, "s": 5, "c": 5, "p": 1.0, "r": 1.0, "f": 1.0},
"edges": {"g": 6, "s": 6, "c": 6, "p": 1.0, "r": 1.0, "f": 1.0},
"attributes": {"g": 0, "s": 0, "c": 0, "p": 0.0, "r": 0.0, "f": 0.0},
"all": {"g": 16, "s": 16, "c": 16, "p": 1.0, "r": 1.0, "f": 1.0},
"time": 0.0018801689147949219,
"cpu": 0.0015930000000000666}
I find it a bit disappointing that the organizers take the position that using the tokens they gave us in the companion data should not be used as input to the parser, and told them so in a comment on cfmrp/mtool#64. Given that there are no whitelisted tokenizers, I don't understand how they expect us to get tokens from input.mrp. Weiwei Sun told me last night that they implemented their own tokenizer from scratch in Peking, but that can't be the purpose of this exercise.
I'm opening another issue for exploring the extent to which mismatches between the assumed tokenizations of input.mrp and udpipe.mrp are going to hurt our scores in other formalisms as well.
@namednil Can you please rerun the training-set evaluation for UCCA with the new version of mtool so we can make some progress on this?
@mariomgmn Please rerun your recent evaluations with the new mtool, because scores may have changed.