Properly measure unicode beyond ascii
JakeWharton opened this issue · 7 comments
SimpleTextLayout
naively assumes char
count is width
I did some tests here and it mostly works quite well, but making it any better will be really hard. Problem is that not all characters are the same width, even in monospace fonts. This is especially common for CJK fonts but basically means you can never assume single characters will align across all of unicode. Note there is information on this width in the unicode spec which could help, but it's not part of standard Java or kotlin APIs.
Examples of text width; depending on your font, some, all, or none of each block should line up but not all monospace fonts are made equal (hoping github won't mess this up):
latin | mmmmm |
Half-kana | ネネネネネ |
Full-latin| mmmmm |
Full-kana | ネネネネネ |
Emoji | 😃😃😃😃😃 |
CJK | 北北北北北 |
I've made a PR for a fix that will measure all characters consistently, but this may make alignment worse for emojis which are generally closer to full width (although often not exactly and font-dependant) and we shouldn't rely on this given there are many full-width characters in BMP.
The better (but potentially horrible to implement) fix would be to use the native font and rendering APIs to measure the text. These are all platform specific (Windows, Android, iOS...) and would require consumers to specify the output font.
But even if you have a pixel size for text it's not clear how it should be aligned just using other characters. Unicode defines a whole bunch of space characters with different widths but support and size will again depend on the font. Probably best to accept this as a know limitation for now
Bonus: Just to make measuring even more impossible, emoji can be modified meaning up to 7 unicode characters can turn into a single glyph (depending on font and OS support).
Update: character width data is available in icu4j: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/lang/UProperty.html#EAST_ASIAN_WIDTH. This won't fix any of the above issues with fonts but will be a good indication whether to measure a char as one or two "units".
Came here to report this in a kotlin script
Unfortunately even with proper measurement, emoji rarely conform to monospace properly so the border characters and subsequent columns will always be misaligned.
The library now handles ANSI escape sequences (which measure to zero) and multi-char codepoints (which measure to one). Going to close for want of specific issues which are not dealing with monospace fonts and their lack of support for emoji and non-western script.