omeka-s-modules/CSVImport

Trimming on multivalue breaks special characters

Closed this issue · 0 comments

We use trim after explode: trim (and explode) are both naive byte-based functions that don't understand UTF-8. trim is particularly problematic in this way because it takes its argument of things to trim as a string: treating every byte as a character.

We're specifically giving trim an argument to try to add non-breaking spaces to the things it will trim, and this just doesn't work: both are UTF-8 characters made up of multiple bytes. The result of passing that to trim is that it's interpreted not as 2 Unicode characters to trim off, but 5 separate bytes C2 A0 E2 80 AF. Any of those appearing at the end or beginning of a string will be trimmed. This happens with legitimate UTF-8 characters: for example à is UTF-8 encoded as C3 A0, so our trim will trim off the A0 and leave us with C3, which is invalid by itself.