CoNLLRDFFormatter -conll: order of rows
Closed this issue · 9 comments
Requirement:
Enforce consistent numerical order of rows in CoNLL export
Description:
Under uncertain circumstances, -conll
export resorts to lexicographic order of nif:Word
s (tbc. whether this uses URI or conll:ID
), i.e., 1 10 11 ... 2 ...
instead of numerical order 1 2 ... 9 10 11 ...
. This is reproducible, but it occurs on samples from the same source corpus (i.e., having the same structure).
Samples:
- snippet-lex-order.zip
- snippet-num-order.zip
- Test with
cat $MY_FILE | ./run.sh CoNLLRDFFormatter -conll ID WORD
Comments:
- Also note that additional line breaks are introduced in CoNLL export. These should apply to the last row only.
- Note that the issue does not apply to
-grammar
or default (RDF) serializations (which seem to follow numerical order consistently).
-conll ID WORD
internally results in the Query
PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?ID ?WORD {
SELECT ?sid ?wid (group_concat(?IDs;separator='|') as ?ID) (group_concat(?WORDs;separator='|') as ?WORD)
WHERE {
?word a nif:Word .
{
SELECT ?word (count(distinct ?preS) as ?sid) (count(distinct ?pre) as ?wid)
WHERE {
?word a nif:Word .
?pre nif:nextWord* ?word .
?word conll:HEAD+ ?s.
?s a nif:Sentence .
?preS nif:nextSentence* ?s .
}
GROUP BY ?word
}
OPTIONAL {
?word conll:ID ?ID_raw .
BIND (str(?ID_raw) as ?IDs)
} .
OPTIONAL {
?word conll:WORD ?WORD_raw .
BIND (str(?WORD_raw) as ?WORDa)
} .
BIND (concat(if(bound(?WORDa),?WORDa,'_'),
IF (EXISTS { ?word nif:nextWord [] }, '', '\n')) as ?WORDs)
}
GROUP BY ?word ?sid ?wid
ORDER BY ?sid ?wid
}
appending ORDER BY xsd:integer(?ID)
results in correct ordering, but this seems like it could create issues down the line.
ok, apparently, there is an issue with ?wid. Either these are not correct (which should result in an unordered result, not a lexicographically ordered one) or this is implicitly cast to string.
Maybe test the following order by clause
ORDER BY xsd:int(?sid) xsd:int(?wid) xsd:int(replace(?ID,'^[^|]*_([0-9]+)([^0-9].*)?$','$1'))
This should fix two possible sources of errors: retyping of ?wid as string (how can that happen) and resorting to (the first) ?ID if ?wid is undefined (should not happen and is probably slow). However, I don't see how any of these errors can arise in the first place. Sure there is no post-ordering after the SPARQL query?
I generated the TSV with added sid and wid columns for both of the snippets to see what was happening.
snippet num order
# global.columns = ID WORD sid wid
1 suîðo 2^^http://www.w3.org/2001/XMLSchema#integer 1^^http://www.w3.org/2001/XMLSchema#integer
2 unuuanda 2^^http://www.w3.org/2001/XMLSchema#integer 2^^http://www.w3.org/2001/XMLSchema#integer
3 uuini 2^^http://www.w3.org/2001/XMLSchema#integer 3^^http://www.w3.org/2001/XMLSchema#integer
4 , 2^^http://www.w3.org/2001/XMLSchema#integer 4^^http://www.w3.org/2001/XMLSchema#integer
5 than 2^^http://www.w3.org/2001/XMLSchema#integer 5^^http://www.w3.org/2001/XMLSchema#integer
6 lang 2^^http://www.w3.org/2001/XMLSchema#integer 6^^http://www.w3.org/2001/XMLSchema#integer
7 hie 2^^http://www.w3.org/2001/XMLSchema#integer 7^^http://www.w3.org/2001/XMLSchema#integer
8 giuuald 2^^http://www.w3.org/2001/XMLSchema#integer 8^^http://www.w3.org/2001/XMLSchema#integer
9 êhta 2^^http://www.w3.org/2001/XMLSchema#integer 9^^http://www.w3.org/2001/XMLSchema#integer
10 , 2^^http://www.w3.org/2001/XMLSchema#integer 10^^http://www.w3.org/2001/XMLSchema#integer
11 Erodes 2^^http://www.w3.org/2001/XMLSchema#integer 11^^http://www.w3.org/2001/XMLSchema#integer
12 thes 2^^http://www.w3.org/2001/XMLSchema#integer 12^^http://www.w3.org/2001/XMLSchema#integer
13 rîkeas 2^^http://www.w3.org/2001/XMLSchema#integer 13^^http://www.w3.org/2001/XMLSchema#integer
14 endi 2^^http://www.w3.org/2001/XMLSchema#integer 14^^http://www.w3.org/2001/XMLSchema#integer
15 râdburdeon 2^^http://www.w3.org/2001/XMLSchema#integer 15^^http://www.w3.org/2001/XMLSchema#integer
16 held 2^^http://www.w3.org/2001/XMLSchema#integer 16^^http://www.w3.org/2001/XMLSchema#integer
17 Iudeo 2^^http://www.w3.org/2001/XMLSchema#integer 17^^http://www.w3.org/2001/XMLSchema#integer
18 liudi 2^^http://www.w3.org/2001/XMLSchema#integer 18^^http://www.w3.org/2001/XMLSchema#integer
19 .
2^^http://www.w3.org/2001/XMLSchema#integer 19^^http://www.w3.org/2001/XMLSchema#integer
snippet lex order
# global.columns = ID WORD sid wid
1 That 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
2 uuolda 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
3 thô 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
4 uuîsara 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
5 filo 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
6 liudo 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
7 barno 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
8 loƀon 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
9 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
10 lêra 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
11 Cristes 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
12 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
13 hêlag 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
14 uuord 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
15 godas 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
16 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
17 endi 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
18 mid 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
19 iro 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
20 handon 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
21 scrîƀan 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
22 berehtlîco 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
23 an 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
24 buok 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
25 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
26 huô 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
27 sia 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
28 is 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
29 gibodscip 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
30 scoldin 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
31 frummian 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
32 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
33 firiho 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
34 barn 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
35 .
0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
The same after instead applying this modification:
# global.columns = ID WORD sid wid
1 suîðo 2^^http://www.w3.org/2001/XMLSchema#integer 1^^http://www.w3.org/2001/XMLSchema#integer
2 unuuanda 2^^http://www.w3.org/2001/XMLSchema#integer 2^^http://www.w3.org/2001/XMLSchema#integer
3 uuini 2^^http://www.w3.org/2001/XMLSchema#integer 3^^http://www.w3.org/2001/XMLSchema#integer
4 , 2^^http://www.w3.org/2001/XMLSchema#integer 4^^http://www.w3.org/2001/XMLSchema#integer
5 than 2^^http://www.w3.org/2001/XMLSchema#integer 5^^http://www.w3.org/2001/XMLSchema#integer
6 lang 2^^http://www.w3.org/2001/XMLSchema#integer 6^^http://www.w3.org/2001/XMLSchema#integer
7 hie 2^^http://www.w3.org/2001/XMLSchema#integer 7^^http://www.w3.org/2001/XMLSchema#integer
8 giuuald 2^^http://www.w3.org/2001/XMLSchema#integer 8^^http://www.w3.org/2001/XMLSchema#integer
9 êhta 2^^http://www.w3.org/2001/XMLSchema#integer 9^^http://www.w3.org/2001/XMLSchema#integer
10 , 2^^http://www.w3.org/2001/XMLSchema#integer 10^^http://www.w3.org/2001/XMLSchema#integer
11 Erodes 2^^http://www.w3.org/2001/XMLSchema#integer 11^^http://www.w3.org/2001/XMLSchema#integer
12 thes 2^^http://www.w3.org/2001/XMLSchema#integer 12^^http://www.w3.org/2001/XMLSchema#integer
13 rîkeas 2^^http://www.w3.org/2001/XMLSchema#integer 13^^http://www.w3.org/2001/XMLSchema#integer
14 endi 2^^http://www.w3.org/2001/XMLSchema#integer 14^^http://www.w3.org/2001/XMLSchema#integer
15 râdburdeon 2^^http://www.w3.org/2001/XMLSchema#integer 15^^http://www.w3.org/2001/XMLSchema#integer
16 held 2^^http://www.w3.org/2001/XMLSchema#integer 16^^http://www.w3.org/2001/XMLSchema#integer
17 Iudeo 2^^http://www.w3.org/2001/XMLSchema#integer 17^^http://www.w3.org/2001/XMLSchema#integer
18 liudi 2^^http://www.w3.org/2001/XMLSchema#integer 18^^http://www.w3.org/2001/XMLSchema#integer
19 .
2^^http://www.w3.org/2001/XMLSchema#integer 19^^http://www.w3.org/2001/XMLSchema#integer
# global.columns = ID WORD sid wid
1 That 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
2 uuolda 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
3 thô 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
4 uuîsara 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
5 filo 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
6 liudo 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
7 barno 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
8 loƀon 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
9 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
10 lêra 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
11 Cristes 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
12 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
13 hêlag 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
14 uuord 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
15 godas 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
16 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
17 endi 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
18 mid 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
19 iro 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
20 handon 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
21 scrîƀan 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
22 berehtlîco 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
23 an 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
24 buok 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
25 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
26 huô 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
27 sia 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
28 is 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
29 gibodscip 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
30 scoldin 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
31 frummian 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
32 , 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
33 firiho 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
34 barn 0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
35 .
0^^http://www.w3.org/2001/XMLSchema#integer 0^^http://www.w3.org/2001/XMLSchema#integer
It apears the ?wid value is still always the same for every word in the lex order example, however the ordering is correct even with the outer order by ?ID
removed, (and I don't yet understand how that would be).
Additionally .
seems to cause a newline to get inserted, which probably it shouldn't?
Ok, so apparently, the following sub-query fails:
?word a nif:Word .
?pre nif:nextWord* ?word .
?word conll:HEAD+ ?s.
?s a nif:Sentence .
?preS nif:nextSentence* ?s .
Maybe put this into two separate sub-selects, one for ?preS
(last three lines) and one for ?pre
(first two lines), so that they are independently executed. If that doesn't work, one of the properties isn't there and we must explore how it is possible that it got lost. If the order is preserved despite these queries failing, this is basically by chance, there is no guarantee for reproducibility.
.
creating a newline is not the desired behaviour. If that isn't from the data, this may be an artifact of the default (rdf) serialization, because there, it is desired.
The ordering is failing for the snippet-lex-order.ttl file because the root element has the relation :s2_8 conll:HEAD "0" .
(string literal 0) instead of :s2_8 conll:HEAD :s2_0 .
I'm unsure about the intended behaviour in cases like this one, but rewriting the query to rely on (or fall back to) wrapping the sentence-id triples in an OPTIONAL seems to fix the word-ordering. (Unless it should fail if the sentence-node and the words are disjunct?)?word conll:HEAD*/conll:EDGE "root"
should fix the word-ordering in this case?
Intended behavior here is a warning ("conll:HEAD
containing literal, not word URI" or the like) and wrapping the sentence IDs in OPTIONAL. CoNLL-RDF should process one sentence at a time, so, normally, this should not create any problems. It is a bug, though, and if somebody uses SPARQL to split sentences, this will mess up sentence order.
The insertion of newlines does not happen with the current release. It's probably an artifact of how I set up my test.
(No action necessary)
Closed for inactivity. Please revivie if reported again.