UTF-16 character issues.

Question

UTF-16 character issues.

kalaspuffar opened this issue 3 years ago · 3 comments

Found an issue with one book and dotify library. When we tried to translate an English book with an alpha character. This char is a multi char that will give us one codepoint but multiple characters when asking for string length.

I will submit a test case that showcases the issue.

https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

Best regards
Daniel

Answer 1 · 2021-06-02T13:10:24.000Z

More specifically, in this case we had two 16-bit values that encoded a single character and codepoint: \uD835\uDEFC

By following the instructions on the wikipedia page, I get that the values encode:

0xD835 - 0xD800 = 0x0035 = 53 in decimal
0xDEFC - 0xDC00 = 0x02FC = 764 in decimal

And the codepoint is:
2^16 + 53 * 2^10 + 764 = 120 572 = 0x1D6FC

https://unicode-table.com/en/#1D6FC

Answer 2 · 2021-06-02T17:42:19.000Z

Thanks.

Actually liblouis-java expects that the length of the "characterAttributes" argument is the same as the length of the Java string (char array), not the number of code points. But I found out now that I was doing it all wrong and the "typeform" and "characterAttributes" arguments were just not working when the input had Unicode characters above U+FFFF.

Will be fixed in the next release.

Answer 3 · 2021-08-18T18:51:17.000Z

Fixed by commits aa08131 and 4b9cc74.