Unicode Character Normalization

Question

Unicode Character Normalization

feO2x opened this issue 3 years ago · 0 comments

According to Daniel Lemire's blog post, unicode strings should be normalized as the same glyph can be represented with a different combination with Unicode code points. E.g., the glyph é can either be represented by \u00e9 or by \u0065\u0301. This is especially bad when comparing two strings that contain the same glyph, but as they contain different code points, they are not considered equal.

As we already normalize strings when calling context.Check(dto.SomeStringValue), we should allow the end user to normalize Unicode code points, too. We have to measure how much performance this takes and maybe make it an opt-in feature.