value of Textcontent dissappears (empty string) upon add?
proycon opened this issue · 7 comments
Something goes wrong when I add TextContent with value eologico*phijsico*metaphijsicum
, libfolia adds an empty text content element instead! I've no idea what triggers this (special meaning for the asterisk perhaps??), other words process fine.
I add TextContent as follows:
https://github.com/LanguageMachines/foliautils/blob/wordtranslate/src/FoLiA-wordtranslate.cxx#L134
Debug output, I explicitly check if I'm not passing an empty string (after trimming even):
$ FoLiA-wordtranslate --outputclass contemporary -d lexicon.1637-2010.250.lexserv.vandale.tsv -p preservation2010.txt -r rules.machine aa__001biog01_01.tok.folia.xml
Loading dictionary...
Loading preserve lexicon...
Loading rules...
DEBUG: target before sanity check 'eologico*phijsico*metaphijsicum'
DEBUG: target after sanity check 'eologico*phijsico*metaphijsicum'
DEBUG: text after adding textcontent ''
finished aa__001biog01_01.tok.folia.xml
Now I can't reproduce the above debug anymore (text content shows fine), but the serialisation to xml still has an empty text..
DEBUG: -- BEFORE APPEND --
DEBUG: target: 'eologico*phijsico*metaphijsicum'
DEBUG: after unicode encoding and decoding 'eologico*phijsico*metaphijsicum'
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent 33
DEBUG: -- AFTER APPEND --
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent 33
DEBUG: from word 'eologico*phijsico*metaphijsicum'
<w xml:id="aa__001biog01_01.TEI.2.text.body.div.p.10802.s.1.w.2" class="WORD-COMPOUND">
<t>theologico-physico-metaphysicum</t>
<t class="contemporary"></t>
<metric class="modernisationsource" value="rules"/>
Ok, the following debug shows the problem, still no idea why though:
DEBUG: target: 'eologico*phijsico*metaphijsicum'
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent 33
DEBUG: from word 'eologico*phijsico*metaphijsicum'
DEBUG: XML serialisation '<w xmlns="http://ilk.uvt.nl/folia" xml:id="aa__001biog01_01.TEI.2.text.body.div.p.10802.s.1.w.2" class="WORD-COMPOUND"><t>theologico-physico-metaphysicum</t><t class="contemporary"></t></w>'
Debug code is committed: https://github.com/LanguageMachines/foliautils/blob/wordtranslate/src/FoLiA-wordtranslate.cxx#L143
It also fails on the following other input words, which probably get mangled to asteriskses too by my tool (incorrectly but that's not the issue here):
- Onderwierum-en-Westerdijkshorn
- Hollandsch-Hoogduitsch-Israelitische
- Arrondissements-kiescollegie
- Kollumerland-en-Nieuw-Kruisland
Hah, there seem to be two 0x00 bytes in front of the string! That would explain things. I should have counted the characters better :)
Conclusion: So this seems to happen if there are invalid characters in the string, I think it would be helpful if this could be caught and a warning outputted when appending text, provided it's not too expensive.
checking for an string to be valid UTF8 is quite expensive.
A 0 is even not 'that invalid'. It is a C string terminator, yielding 'empty' strings when in front.
I didn't find an easy way to check validity.
Ok. the problem occurred in a program that incorrectly used the libicu API, yielding iinvalid Unicode strings.
That is a quality of implementation problem in libicu. Not in libfolia.