LanguageMachines/libfolia

value of Textcontent dissappears (empty string) upon add?

proycon opened this issue · 7 comments

Something goes wrong when I add TextContent with value eologico*phijsico*metaphijsicum, libfolia adds an empty text content element instead! I've no idea what triggers this (special meaning for the asterisk perhaps??), other words process fine.

I add TextContent as follows:
https://github.com/LanguageMachines/foliautils/blob/wordtranslate/src/FoLiA-wordtranslate.cxx#L134

Debug output, I explicitly check if I'm not passing an empty string (after trimming even):

$ FoLiA-wordtranslate --outputclass contemporary -d lexicon.1637-2010.250.lexserv.vandale.tsv -p preservation2010.txt -r rules.machine aa__001biog01_01.tok.folia.xml
                                                                                                                                                                                                                                                            
Loading dictionary...                                                                                                                                                                                                                                       
Loading preserve lexicon...                                                                                                                                                                                                                                 
Loading rules...                                                                                                                                                                                                                                            
DEBUG: target before sanity check 'eologico*phijsico*metaphijsicum'                                                                                                                                                                                         
DEBUG: target after sanity check 'eologico*phijsico*metaphijsicum'
DEBUG: text after adding textcontent ''
finished aa__001biog01_01.tok.folia.xml 

Now I can't reproduce the above debug anymore (text content shows fine), but the serialisation to xml still has an empty text..

DEBUG: -- BEFORE APPEND --                                                                                                                                                                                                                                  
DEBUG: target: 'eologico*phijsico*metaphijsicum'
DEBUG: after unicode encoding and decoding 'eologico*phijsico*metaphijsicum'
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent  33
DEBUG: -- AFTER APPEND -- 
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent  33
DEBUG: from word 'eologico*phijsico*metaphijsicum'
          <w xml:id="aa__001biog01_01.TEI.2.text.body.div.p.10802.s.1.w.2" class="WORD-COMPOUND">
            <t>theologico-physico-metaphysicum</t>
            <t class="contemporary"></t>
            <metric class="modernisationsource" value="rules"/>

Ok, the following debug shows the problem, still no idea why though:

DEBUG: target: 'eologico*phijsico*metaphijsicum'                                                                                                                                                                                                            
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent  33
DEBUG: from word 'eologico*phijsico*metaphijsicum'
DEBUG: XML serialisation '<w xmlns="http://ilk.uvt.nl/folia" xml:id="aa__001biog01_01.TEI.2.text.body.div.p.10802.s.1.w.2" class="WORD-COMPOUND"><t>theologico-physico-metaphysicum</t><t class="contemporary"></t></w>'  

Debug code is committed: https://github.com/LanguageMachines/foliautils/blob/wordtranslate/src/FoLiA-wordtranslate.cxx#L143

It also fails on the following other input words, which probably get mangled to asteriskses too by my tool (incorrectly but that's not the issue here):

  • Onderwierum-en-Westerdijkshorn
  • Hollandsch-Hoogduitsch-Israelitische
  • Arrondissements-kiescollegie
  • Kollumerland-en-Nieuw-Kruisland

Hah, there seem to be two 0x00 bytes in front of the string! That would explain things. I should have counted the characters better :)

Conclusion: So this seems to happen if there are invalid characters in the string, I think it would be helpful if this could be caught and a warning outputted when appending text, provided it's not too expensive.

checking for an string to be valid UTF8 is quite expensive.
A 0 is even not 'that invalid'. It is a C string terminator, yielding 'empty' strings when in front.
I didn't find an easy way to check validity.

Ok. the problem occurred in a program that incorrectly used the libicu API, yielding iinvalid Unicode strings.
That is a quality of implementation problem in libicu. Not in libfolia.