UniversalDependencies/tools

warnings about punctuation and non-projectivity are wrong in validate.py

arademaker opened this issue ยท 12 comments

Take the sentence CF0883-1. After long discussion in LR-POR/cl-conllu#85. I believe that:

The script validate.py reports 3 warnings

[Line 28 Sent CF883-1 Node 22]: [L3 Syntax punct-causes-nonproj] Punctuation must not cause non-projectivity of nodes [4, 23, 24]

token 4 is projective, so it doesn't make sense being reported as a token that became non-projective because of 22. Token 24 is a non-projective token but it is a punctuation, so it is already reported as punct-is-nonproj case below.

[Line 28 Sent CF883-1 Node 22]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

token 21 should be included in the list.

[Line 30 Sent CF883-1 Node 24]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [22]

token 23 should be included in the list.

It is not uncommon that one error causes violation of multiple constraints, so I would not bother about node 24 being reported both as offender in punct-is-nonproj and offended in punct-causes-nonproj. When the tree is fixed, both messages will disappear.

Token 4 (or edge 26-4) is not considered projective because node 22 lies between 4 and 26 and is neither dominated by 4 nor directly dependent on 26. (It is however dominated by 26 and I am not sure if some definitions of projectivity wouldn't be satisfied with that.)

Not reporting the last node in the gap is a bug, probably something with Python ranges. I will look into it.

So all treebanks have been re-tested with the fixed validator, and it turns out there are quite a few newly discovered punct-nonproj errors in Arabic, Bhojpuri, Bulgarian, Cantonese, Chinese, Chukchi, Coptic, Croatian, English, Erzya, Finnish, French, Galician, German, Icelandic, Irish, Italian, Karelian, Komi Permyak, Lithuanian, Livvi, Magahi, Maltese, Moksha, Old Russian, Polish, Romanian, Russian, Scottish Gaelic, Serbian, Slovenian, Spanish, Swedish, Turkish, Urdu and Welsh.

It is however dominated by 26 and I am not sure if some definitions of projectivity wouldn't be satisfied with that.

The standard definition of projectivity is satisfied with that, e.g. according to Sandra Kรผbler, Ryan McDonald, Joakim Nivre: "an arc in a tree is projective if there is a directed path from the head word to all the words between the two endpoints of the arc." (the book includes also a proper mathematical definition; the same definition can be found in many other papers on projectivity).

So, in the example above the edge between nodes 4 and 26 is projective because all the nodes within its span (5..25) can be reached (by directed paths) from the head node 26. The fact that node 22 can be reached only via node 1, which is outside the span, is irrelevant.
However, the edge 22-1 is non-projective (with nodes 3..21 in the gap, i.e. these nodes are causing the non-projectivity of the edge 22-1). (Maybe this is what you wanted to write.)
Node 22 also causes a non-projectivity of two other edges: 23-4 and 24-12.
In Udapi, I've decided to report only those PUNCT nodes causing a non-projective gap such that their parent is not causing the gap (similarly to validate.py if I understand get_caused_nonprojectivities correctly).

is not considered projective because node 22 lies between 4 and 26 and is neither dominated by 4 nor directly dependent on 26

Such definition of non-projectivity would be completely wrong: it would consider e.g. edge 18-12 non-projective because there is node 14 and it lies between 12 and 18 and is neither dominated by 18 nor directly dependent on 12.

@dan-zeman Fixed in Maltese.

@martinpopel : You are right, my formulation of the reason why edge 26-4 is considered non-projective does not work. It is quite possible that the real reason why the validator reports node 4 as being in the gap is simply another bug, as I normally operate with exactly the definition of non-projectivity that you cite as standard (i.e., all nodes between parent and child must be dominated by parent, period).

However, I don't think I've encountered an example like this in the past, and it bothers me that the standard definition does not consider node 22 as a gap in 26-4. Of course 1-22 is non-projective (and for the sake of validation, we don't need to know more) but since the dependencies are crossing, I would want to mark both of them as non-projective.

In general, you cannot assume that two crossing dependencies are both non-projective. In fact, this happens only in structures that are not well-nested. Otherwise, it is typically only one of the arcs that is non-projective, while the other is an innocent bystander being hit by the crossing non-projective arc. I have not studied the particular example, so I cannot say whether there is a bug, but the fact that only one of two crossing dependencies is marked as non-projective is as it should be.

@dan-zeman Fixed in Bulgarian.

In the UD_Portuguese-Bosque, we have 162 cases reported by our tool that are not reported by validate.py as cases of non-projective punctuation or causes by punctuation.

One particular example is CF2-2 our tool reports that tokens 11 and 7 causes non-projectivity of token 4.

WORKING> (cl-conllu::validate-punct (cadr (read-conllu "documents/CF0002.conllu")))
((#<TOKEN "ยป" PUNCT #11-punct-8> CL-CONLLU::PUNCT-CAUSES-NONPROJ-OF 4)
 (#<TOKEN "ยซ" PUNCT #7-punct-8> CL-CONLLU::PUNCT-CAUSES-NONPROJ-OF 4))
โ”€โ”ฎ  
 โ”‚                                         โ•ญโ”€โ•ผ Desde ADP case 1 4  
 โ”‚                                         โ”œโ”€โ•ผ o DET det 2 4  
 โ”‚                                         โ”œโ”€โ•ผ รบltimo ADJ amod 3 4  
 โ”‚                                       โ•ญโ”€โ”พ dia NOUN obl 4 14  
 โ”‚                                       โ”‚ โ”œโ”€โ•ผ 13 NUM nummod 5 4  
 โ”‚                                       โ”‚ โ•ฐโ”€โ•ผ , PUNCT punct 6 4  
 โ”‚   โ•ญโ”€โ•ผ ยซ PUNCT punct 7 8               โ”‚ 
 โ”‚ โ•ญโ”€โ”พ Confissรตes PROPN nsubj:pass 8 12  โ”‚ 
 โ”‚ โ”‚ โ”‚ โ•ญโ”€โ•ผ de ADP case 9 10              โ”‚ 
 โ”‚ โ”‚ โ”œโ”€โ”ถ Adolescente PROPN nmod 10 8     โ”‚ 
 โ”‚ โ”‚ โ•ฐโ”€โ•ผ ยป PUNCT punct 11 8              โ”‚ 
 โ•ฐโ”€โ”พ pode VERB root 12 0                 โ”‚ 
   โ”‚                                     โ”œโ”€โ•ผ ser AUX aux:pass 13 14  
   โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”พ vista VERB xcomp 14 12  
                                         โ”‚ โ•ญโ”€โ•ผ por ADP case 15 17  
                                         โ”‚ โ”œโ”€โ•ผ os DET det 16 17  
                                         โ”œโ”€โ”พ teens NOUN obl:agent 17 14  
                                         โ”‚ โ•ฐโ”€โ•ผ portugueses ADJ amod 18 17  
                                         โ•ฐโ”€โ•ผ . PUNCT punct 19 14  

Maybe we need a better definition for cause non-projectivity of nodes...

@dan-zeman

It is quite possible that the real reason why the validator reports node 4 as being in the gap is simply another bug

Yes, I fixed it in #67

it bothers me that the standard definition does not consider node 22 as a gap in 26-4.

Why does it bother you? Do you know any example where a node (22) would "cause a non-projectivity according to your definition" (but not according to the standard definition), but would be still projective? I think it is impossible. So these cases will be always reported as errors by validate.py.

@arademaker

One particular example is CF2-2 our tool reports that tokens 11 and 7 causes non-projectivity of token 4.
Maybe we need a better definition for cause non-projectivity of nodes...

Tokens 11 and 7 in your example are attached correctly, following the guidelines.
Both validate.py and ud.MarkBugs report no errors in your example, so you just need to fix your tool.

Explanation:
The edges from tokens 11 and 7 to their parents are projective.
Tokens 11 and 7 are in a non-projective gap of the edge 4-14, but they are not causing this gap. It is their parent 8 (and all its descendants), what is causing the gap (because it is in the gap, but its parent 12 is not). However, token 8 is PROPN, not PUNCT, so it can cause a non-projective gap without breaking the guidelines.

Thank you @martinpopel , yes, your analysis is what I was expecting. Actually, in this case I also think that our tools should not report the error it is reporting. I was just double-checking if we all have a precise (and the same) definition for the warnings produced by validate.py.