davedelong/CHCSVParser

Problem reading little-endian Unicode

samalone opened this issue · 1 comments

I have an NSString that was converted from little-endian Unicode NSData. The first character of the string is the Unicode byte-order marker (BOM), which in little endian is 0xFF 0xFE.

The first time _loadMoreIfNecessary calls initWithBytes:length:encoding:, the BOM is in the buffer and the buffer is read correctly. However, when the second buffer is converted there is no BOM, and the data is treated as big-endian. This means that the second and all subsequent buffers of data are corrupted.

In one sense, the bug is that _loadMoreIfNecessary is converting each buffer of text independently, rather than maintaining conversion context from one buffer to the next. In general, text encodings require context to handle multi-byte characters, byte order markers and such. A more robust version of this function would use the lower-level Text Encoding Converter, which maintains context from one buffer to the next.

But an easier fix might be to change initWithCSVString: to use a fixed encoding like NSUTF16BigEndianStringEncoding rather than calling [csv fastestEncoding], which evaluates to NSUnicodeStringEncoding which is ambiguous. I believe that using a unambiguous encoding would prevent the error, even if its not as general a solution as using Text Encoding Converter.

The convenience initializers now use a fixed encoding (NSUTF8StringEncoding), but this would still be an issue for NSInputStreams provided to the designated initializer.