Orange-OpenSource/conllueditor

Alignment between text and syntactic tree

Closed this issue · 4 comments

Hi!

There seems to be an issue when editing the SpaceAfter value in the MISC field: the text is not readjusted accordingly, and so an alignment error ensues. But this should be no different from the manipulations of MWT and similar things, right?

this is the change I did on purpose (a while ago) to help the annotator to see errors when the # text is not coherent with the Space(s)After keys in MISC. Imagine a sentences automatically "annotated" using a parser which is gong to be corrected manually. If the parser missed more than one Space(s)After, the user corrects one of them. If ConlluEditor updates the # text field, the second "bad" Space(s)After will be gone (instead of correcting the bas MISC field, CE would have changed # text. For sentsplit and split I agrred, that here the annotators (should) know what theyr are doing, so adapting # text automatically seems the best option. But here I'm in doubt. What do you think?

I admit I do not understand the issue well... do you have an example? Anyway, when correcting SpaceAfter, even if the text is not automatically recompiled, shouldn't there be a way to do it inside conllueditor?

You can edit the text by clicking on edit metadata.

As an example of what I meant above: Imagine a sentence be validated

# text = veni, vidi, vici
1       veni    venire  VERB    _       _       0       root    _       _
2       ,       ,       PUNCT   _       _       1       punct   _       _
3       vidi    videre  VERB    _       _       1       conj    _       _
4       ,       ,       PUNCT   _       _       3       punct   _       _
5       vici    vincere VERB    _       _       1       conj    _       _

the commas 2 and 4 are not preceded by a space, but tokens 1 and 3 miss SpaceAfter=No. With the current version CE tells you that there is an incoherence at the first comma. If you add the missing SpaceAfter=No, CE will tell you that there is still an incoherence at the second comma. However if CE adapted # text in function of the Space(s)After of the tokens, once you have added the first SpaceAfter=No to veni, CE would silently correct # text to # text = veni, vidi , vici and show that everything is fine.
But now CE would have modified the original text to match with token 3 instead of insisting to checking token 3 and either add a SpaceAfter=No or manually modify # text. Since in many treebanks the # text line seems to be taken from a corpus and therefore should not be modified, I do not want to break this by an automatic adaptation.

Ah, OK, it is clear. It is indeed an issue. I did not get the possible metadata change, but probably it is good as it is now, since I am thinking of occasional corrections, not systematic mass editing.

Thanks for the explanations!