Trimming on multivalue breaks special characters
Closed this issue · 0 comments
We use trim after explode: trim (and explode) are both naive byte-based functions that don't understand UTF-8. trim is particularly problematic in this way because it takes its argument of things to trim as a string: treating every byte as a character.
We're specifically giving trim an argument to try to add non-breaking spaces to the things it will trim, and this just doesn't work: both are UTF-8 characters made up of multiple bytes. The result of passing that to trim is that it's interpreted not as 2 Unicode characters to trim off, but 5 separate bytes C2 A0 E2 80 AF
. Any of those appearing at the end or beginning of a string will be trimmed. This happens with legitimate UTF-8 characters: for example à
is UTF-8 encoded as C3 A0
, so our trim will trim off the A0
and leave us with C3
, which is invalid by itself.