Utf-8 support for 4 byte chars
avrecko opened this issue · 2 comments
Looking at Utf8 code. I love the simplicity. But it looks like there is no support for 4 byte utf-8 characters. Came across a situation where 4 byte utf8 character was incorrectly encoded. Character in question https://www.compart.com/en/unicode/U+1F3A9.
Any chance to add 4 byte utf8 support in
https://github.com/odnoklassniki/one-nio/blob/ab2e0a6adbc3017540dc7e6691a11331fb1ed942/src/one/nio/util/Utf8.java?
The encoding used by Utf8
class is not the real UTF-8, but rather the Modified UTF-8 as specified by DataInput/DataOutput API. It handles 4-byte characters, but in different way: high surrogate and low surrogate characters are encoded separately.
Utf8
class was originally made for one-nio serialization framework, and the encoding rules were somewhat similar to DataInput/DataOutput ones. DataStream
uses big endian for the same reason. It wasn't probably the best choice, but nonetheless it already works in many places, and there are no plans to make breaking changes in this area.
For the true UTF-8 conversion, the standard CharsetEncoder
is just fine.
Makes sense. On first look I thought it is done for performance reasons. As in my benchmarking Utf8
outperforms CharsetEncoder
.
Will just go with CharsetEncoder
. Thank you for the clarification.