Unicode Character Normalization
feO2x opened this issue · 0 comments
feO2x commented
According to Daniel Lemire's blog post, unicode strings should be normalized as the same glyph can be represented with a different combination with Unicode code points. E.g., the glyph é
can either be represented by \u00e9
or by \u0065\u0301
. This is especially bad when comparing two strings that contain the same glyph, but as they contain different code points, they are not considered equal.
As we already normalize strings when calling context.Check(dto.SomeStringValue)
, we should allow the end user to normalize Unicode code points, too. We have to measure how much performance this takes and maybe make it an opt-in feature.