UTF-16 character issues.
kalaspuffar opened this issue · 3 comments
Hi @bertfrees
Found an issue with one book and dotify library. When we tried to translate an English book with an alpha character. This char is a multi char that will give us one codepoint but multiple characters when asking for string length.
I will submit a test case that showcases the issue.
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
Best regards
Daniel
More specifically, in this case we had two 16-bit values that encoded a single character and codepoint: \uD835\uDEFC
By following the instructions on the wikipedia page, I get that the values encode:
0xD835 - 0xD800 = 0x0035 = 53 in decimal
0xDEFC - 0xDC00 = 0x02FC = 764 in decimal
And the codepoint is:
2^16 + 53 * 2^10 + 764 = 120 572 = 0x1D6FC
Thanks.
Actually liblouis-java expects that the length of the "characterAttributes" argument is the same as the length of the Java string (char array), not the number of code points. But I found out now that I was doing it all wrong and the "typeform" and "characterAttributes" arguments were just not working when the input had Unicode characters above U+FFFF.
Will be fixed in the next release.