A few Terms bugs in our huge corpus
Opened this issue · 0 comments
jan-niestadt commented
In chn-intern, running the TermSerialization tool finds terms that don't correctly "round-trip" (i.e. get the id for the term, then get the term for that id again), although not too many on a 3B word corpus. All of these either include a dash or an unusual Unicode character. (the problem with the dash may be related to another "dash-like" character, as there are a few of those, e.g. endash, emdash, soft hyphen)
Should be investigated further. May be a bug in the Terms code, or during indexing, or something else.
<0xfeff> ZERO WIDTH NO-BREAK SPACE
<0x200e> LEFT-TO-RIGHT MARK
termId2 == -1: '-teken'
termId2 == -1: '-jes'
termId2 == -1: '-mail'
termId2 == -1: '-uitgave'
termId2 == -1: 'NVP-directeur'
termId2 == -1: 'mai-tres'
termId2 == -1: '-teken'
termId2 == -1: 'Bene-decreten'
termId2 == -1: '-day'
termId2 == -1: '-de'
termId2 == -1: '-of'
termId2 == -1: 'frai-che'
termId2 == -1: '<0xfeff>Uiteindelijk'
termId2 == -1: 'DNA<0x200e>-sporen<0x200e>'
termId2 == -1: 'Media<0x200e>-aandacht'
termId2 == -1: 'KGB<0x200e>-agente'
termId2 == -1: 'KGB<0x200e>-zaken<0x200e>'
termId2 == -1: 'Luitenant<0x200e>-kolonel'
termId2 == -1: 'NAVO<0x200e>-wiendelijk<0x200e>'
termId2 == -1: 'Play<0x200e>-off'
termId2 == -1: 'Vol'<0x200e>-licht'
termId2 == -1: 'je<0x200e>-weet<0x200e>-wel<0x200e>-wie<0x200e>'
termId2 == -1: 'make<0x200e>-up'
termId2 == -1: 'try<0x200e>-out'
termId2 == -1: 'Dunlap-mokkel'
termId2 == -1: 'pocus-gezeik'
termId2 == -1: 'priv-leven'