odnoklassniki/one-nio

Utf-8 support for 4 byte chars

avrecko opened this issue · 2 comments

Looking at Utf8 code. I love the simplicity. But it looks like there is no support for 4 byte utf-8 characters. Came across a situation where 4 byte utf8 character was incorrectly encoded. Character in question https://www.compart.com/en/unicode/U+1F3A9.

Any chance to add 4 byte utf8 support in
https://github.com/odnoklassniki/one-nio/blob/ab2e0a6adbc3017540dc7e6691a11331fb1ed942/src/one/nio/util/Utf8.java?

The encoding used by Utf8 class is not the real UTF-8, but rather the Modified UTF-8 as specified by DataInput/DataOutput API. It handles 4-byte characters, but in different way: high surrogate and low surrogate characters are encoded separately.

Utf8 class was originally made for one-nio serialization framework, and the encoding rules were somewhat similar to DataInput/DataOutput ones. DataStream uses big endian for the same reason. It wasn't probably the best choice, but nonetheless it already works in many places, and there are no plans to make breaking changes in this area.

For the true UTF-8 conversion, the standard CharsetEncoder is just fine.

Makes sense. On first look I thought it is done for performance reasons. As in my benchmarking Utf8 outperforms CharsetEncoder.

Will just go with CharsetEncoder. Thank you for the clarification.