FoLiA nodes with 'mixed' structure
kosloot opened this issue · 7 comments
Consider this example:
<?xml version='1.0' encoding='utf-8'?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" version="1.5.1" xml:id="page" generator="pynlpl.formats.folia-v1.5.1.88">
<metadata type="native">
<annotations>
<token-annotation annotator="ucto" annotatortype="auto" datetime="2017-10-01T17:33:00" set="tokconfig-nld"/>
</annotations>
<meta id="language">nld</meta>
</metadata>
<text xml:id="text">
<s xml:id="s.1"><t>test twee</t></s>
<p xml:id="p1">
<w xml:id="w.1">
<t>test</t>
</w>
<w xml:id="w.2">
<t>aha</t>
</w>
<s xml:id="s.2">
<t>Een brief voor de koning.</t>
</s>
</p>
</text>
</FoLiA>
At the moment Frog will ignore the two words in the paragraph and only handle the sentence within.
This is questionable.
But if we do want to handle those 2 loose words, what is desired then? Should we create a sentence out of them? or leave them separated?
This also involves Ucto, as that is used to create the sentences. (but not for the new Frog implementation we are working on)
I just tested this, and the "problem" still exists. Frog will ignore the words test and aha.
@proycon can we decide on this?. Or leave it just as an oddity, due to "someone" creating stupid FoLiA?
Technically ignoring the words is wrong. They are part of the text, just not grouped in a sentence, it may be weird and inconsistent, but it's not invalid FoLiA. It's perfectly okay though if Frog decides not to support this, I'd suggest exiting with an error if it encounters this pattern. (not really a priority though)
Yes it is valid, though weird FoLiA.
Detecting this and generation an error is probably the best indeed.
Really processing this is really cumbersome, it would imply inserting a new Sentence BEFORE the current Sentence in the paragraph. With id naming problems and such. It MUST be possible, but not worth wile I suppose.
We had code to ignore this silently. But from now on we will throw an exception.
Ok, I solved it. But the extra generated Sentence becomes an xml:id which may be surprising.
Must look into that still
Ok, I solved it. But the extra generated Sentence becomes an xml:id which may be surprising.
That solution was way to naive.
Reverted to the throw it into your face solution
So we leave it for now.