acoli-repo/conll-rdf

CoNLLRDFFormatter -conll: order of rows

chiarcos opened this issue · 9 comments

Requirement:
Enforce consistent numerical order of rows in CoNLL export

Description:
Under uncertain circumstances, -conll export resorts to lexicographic order of nif:Words (tbc. whether this uses URI or conll:ID), i.e., 1 10 11 ... 2 ... instead of numerical order 1 2 ... 9 10 11 .... This is reproducible, but it occurs on samples from the same source corpus (i.e., having the same structure).

Samples:

Comments:

  • Also note that additional line breaks are introduced in CoNLL export. These should apply to the last row only.
  • Note that the issue does not apply to -grammar or default (RDF) serializations (which seem to follow numerical order consistently).

-conll ID WORD internally results in the Query

PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?ID ?WORD {
	SELECT ?sid ?wid (group_concat(?IDs;separator='|') as ?ID) (group_concat(?WORDs;separator='|') as ?WORD)
	WHERE {
		?word a nif:Word .
		{
			SELECT ?word (count(distinct ?preS) as ?sid) (count(distinct ?pre) as ?wid)
			WHERE {
				?word a nif:Word .
				?pre nif:nextWord* ?word .
				?word conll:HEAD+ ?s.
				?s a nif:Sentence .
				?preS nif:nextSentence* ?s .
			}
			GROUP BY ?word
		}
		OPTIONAL {
			?word conll:ID ?ID_raw .
			BIND (str(?ID_raw) as ?IDs)
		} .
		OPTIONAL {
			?word conll:WORD ?WORD_raw .
			BIND (str(?WORD_raw) as ?WORDa)
		} .
		BIND (concat(if(bound(?WORDa),?WORDa,'_'),
		IF (EXISTS { ?word nif:nextWord [] }, '', '\n')) as ?WORDs)
	}
	GROUP BY ?word ?sid ?wid
	ORDER BY ?sid ?wid
}

appending ORDER BY xsd:integer(?ID) results in correct ordering, but this seems like it could create issues down the line.

ok, apparently, there is an issue with ?wid. Either these are not correct (which should result in an unordered result, not a lexicographically ordered one) or this is implicitly cast to string.

Maybe test the following order by clause

ORDER BY xsd:int(?sid) xsd:int(?wid) xsd:int(replace(?ID,'^[^|]*_([0-9]+)([^0-9].*)?$','$1'))

This should fix two possible sources of errors: retyping of ?wid as string (how can that happen) and resorting to (the first) ?ID if ?wid is undefined (should not happen and is probably slow). However, I don't see how any of these errors can arise in the first place. Sure there is no post-ordering after the SPARQL query?

I generated the TSV with added sid and wid columns for both of the snippets to see what was happening.

snippet num order

# global.columns = ID WORD sid wid
1	suîðo	2^^http://www.w3.org/2001/XMLSchema#integer	1^^http://www.w3.org/2001/XMLSchema#integer
2	unuuanda	2^^http://www.w3.org/2001/XMLSchema#integer	2^^http://www.w3.org/2001/XMLSchema#integer
3	uuini	2^^http://www.w3.org/2001/XMLSchema#integer	3^^http://www.w3.org/2001/XMLSchema#integer
4	,	2^^http://www.w3.org/2001/XMLSchema#integer	4^^http://www.w3.org/2001/XMLSchema#integer
5	than	2^^http://www.w3.org/2001/XMLSchema#integer	5^^http://www.w3.org/2001/XMLSchema#integer
6	lang	2^^http://www.w3.org/2001/XMLSchema#integer	6^^http://www.w3.org/2001/XMLSchema#integer
7	hie	2^^http://www.w3.org/2001/XMLSchema#integer	7^^http://www.w3.org/2001/XMLSchema#integer
8	giuuald	2^^http://www.w3.org/2001/XMLSchema#integer	8^^http://www.w3.org/2001/XMLSchema#integer
9	êhta	2^^http://www.w3.org/2001/XMLSchema#integer	9^^http://www.w3.org/2001/XMLSchema#integer
10	,	2^^http://www.w3.org/2001/XMLSchema#integer	10^^http://www.w3.org/2001/XMLSchema#integer
11	Erodes	2^^http://www.w3.org/2001/XMLSchema#integer	11^^http://www.w3.org/2001/XMLSchema#integer
12	thes	2^^http://www.w3.org/2001/XMLSchema#integer	12^^http://www.w3.org/2001/XMLSchema#integer
13	rîkeas	2^^http://www.w3.org/2001/XMLSchema#integer	13^^http://www.w3.org/2001/XMLSchema#integer
14	endi	2^^http://www.w3.org/2001/XMLSchema#integer	14^^http://www.w3.org/2001/XMLSchema#integer
15	râdburdeon	2^^http://www.w3.org/2001/XMLSchema#integer	15^^http://www.w3.org/2001/XMLSchema#integer
16	held	2^^http://www.w3.org/2001/XMLSchema#integer	16^^http://www.w3.org/2001/XMLSchema#integer
17	Iudeo	2^^http://www.w3.org/2001/XMLSchema#integer	17^^http://www.w3.org/2001/XMLSchema#integer
18	liudi	2^^http://www.w3.org/2001/XMLSchema#integer	18^^http://www.w3.org/2001/XMLSchema#integer
19	.
	2^^http://www.w3.org/2001/XMLSchema#integer	19^^http://www.w3.org/2001/XMLSchema#integer

snippet lex order

# global.columns = ID WORD sid wid
1	That	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
2	uuolda	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
3	thô	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
4	uuîsara	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
5	filo	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
6	liudo	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
7	barno	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
8	loƀon	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
9	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
10	lêra	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
11	Cristes	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
12	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
13	hêlag	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
14	uuord	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
15	godas	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
16	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
17	endi	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
18	mid	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
19	iro	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
20	handon	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
21	scrîƀan	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
22	berehtlîco	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
23	an	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
24	buok	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
25	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
26	huô	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
27	sia	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
28	is	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
29	gibodscip	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
30	scoldin	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
31	frummian	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
32	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
33	firiho	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
34	barn	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
35	.
	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer

The same after instead applying this modification:

# global.columns = ID WORD sid wid
1	suîðo	2^^http://www.w3.org/2001/XMLSchema#integer	1^^http://www.w3.org/2001/XMLSchema#integer
2	unuuanda	2^^http://www.w3.org/2001/XMLSchema#integer	2^^http://www.w3.org/2001/XMLSchema#integer
3	uuini	2^^http://www.w3.org/2001/XMLSchema#integer	3^^http://www.w3.org/2001/XMLSchema#integer
4	,	2^^http://www.w3.org/2001/XMLSchema#integer	4^^http://www.w3.org/2001/XMLSchema#integer
5	than	2^^http://www.w3.org/2001/XMLSchema#integer	5^^http://www.w3.org/2001/XMLSchema#integer
6	lang	2^^http://www.w3.org/2001/XMLSchema#integer	6^^http://www.w3.org/2001/XMLSchema#integer
7	hie	2^^http://www.w3.org/2001/XMLSchema#integer	7^^http://www.w3.org/2001/XMLSchema#integer
8	giuuald	2^^http://www.w3.org/2001/XMLSchema#integer	8^^http://www.w3.org/2001/XMLSchema#integer
9	êhta	2^^http://www.w3.org/2001/XMLSchema#integer	9^^http://www.w3.org/2001/XMLSchema#integer
10	,	2^^http://www.w3.org/2001/XMLSchema#integer	10^^http://www.w3.org/2001/XMLSchema#integer
11	Erodes	2^^http://www.w3.org/2001/XMLSchema#integer	11^^http://www.w3.org/2001/XMLSchema#integer
12	thes	2^^http://www.w3.org/2001/XMLSchema#integer	12^^http://www.w3.org/2001/XMLSchema#integer
13	rîkeas	2^^http://www.w3.org/2001/XMLSchema#integer	13^^http://www.w3.org/2001/XMLSchema#integer
14	endi	2^^http://www.w3.org/2001/XMLSchema#integer	14^^http://www.w3.org/2001/XMLSchema#integer
15	râdburdeon	2^^http://www.w3.org/2001/XMLSchema#integer	15^^http://www.w3.org/2001/XMLSchema#integer
16	held	2^^http://www.w3.org/2001/XMLSchema#integer	16^^http://www.w3.org/2001/XMLSchema#integer
17	Iudeo	2^^http://www.w3.org/2001/XMLSchema#integer	17^^http://www.w3.org/2001/XMLSchema#integer
18	liudi	2^^http://www.w3.org/2001/XMLSchema#integer	18^^http://www.w3.org/2001/XMLSchema#integer
19	.
	2^^http://www.w3.org/2001/XMLSchema#integer	19^^http://www.w3.org/2001/XMLSchema#integer
# global.columns = ID WORD sid wid
1	That	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
2	uuolda	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
3	thô	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
4	uuîsara	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
5	filo	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
6	liudo	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
7	barno	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
8	loƀon	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
9	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
10	lêra	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
11	Cristes	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
12	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
13	hêlag	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
14	uuord	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
15	godas	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
16	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
17	endi	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
18	mid	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
19	iro	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
20	handon	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
21	scrîƀan	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
22	berehtlîco	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
23	an	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
24	buok	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
25	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
26	huô	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
27	sia	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
28	is	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
29	gibodscip	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
30	scoldin	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
31	frummian	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
32	,	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
33	firiho	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
34	barn	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer
35	.
	0^^http://www.w3.org/2001/XMLSchema#integer	0^^http://www.w3.org/2001/XMLSchema#integer

It apears the ?wid value is still always the same for every word in the lex order example, however the ordering is correct even with the outer order by ?ID removed, (and I don't yet understand how that would be).

Additionally . seems to cause a newline to get inserted, which probably it shouldn't?

Ok, so apparently, the following sub-query fails:

			?word a nif:Word .
			?pre nif:nextWord* ?word .
			?word conll:HEAD+ ?s.
			?s a nif:Sentence .
			?preS nif:nextSentence* ?s .

Maybe put this into two separate sub-selects, one for ?preS (last three lines) and one for ?pre (first two lines), so that they are independently executed. If that doesn't work, one of the properties isn't there and we must explore how it is possible that it got lost. If the order is preserved despite these queries failing, this is basically by chance, there is no guarantee for reproducibility.

. creating a newline is not the desired behaviour. If that isn't from the data, this may be an artifact of the default (rdf) serialization, because there, it is desired.

The ordering is failing for the snippet-lex-order.ttl file because the root element has the relation :s2_8 conll:HEAD "0" . (string literal 0) instead of :s2_8 conll:HEAD :s2_0 .
I'm unsure about the intended behaviour in cases like this one, but rewriting the query to rely on (or fall back to) ?word conll:HEAD*/conll:EDGE "root" should fix the word-ordering in this case? wrapping the sentence-id triples in an OPTIONAL seems to fix the word-ordering. (Unless it should fail if the sentence-node and the words are disjunct?)

Intended behavior here is a warning ("conll:HEAD containing literal, not word URI" or the like) and wrapping the sentence IDs in OPTIONAL. CoNLL-RDF should process one sentence at a time, so, normally, this should not create any problems. It is a bug, though, and if somebody uses SPARQL to split sentences, this will mess up sentence order.

The insertion of newlines does not happen with the current release. It's probably an artifact of how I set up my test.
(No action necessary)

Closed for inactivity. Please revivie if reported again.