utf-8 file with BOM passes BOM as part of first header NAME

Question

utf-8 file with BOM passes BOM as part of first header NAME

Closed this issue 9 years ago · 2 comments

If my file/stream starts with the UTF-8 BOM (3 char \xef\xbb\xbf), it is passed through as part of the first header (or first data value on the first row).

Should unicodecsv handle this (remove it), or should the user sniff for and skip over it before instantiating the UnicodeReader class?

What do you think the best way to handle this is? FWIIW, I'm using Python 2.7.

Answer 1 · 2016-04-11T17:53:56.000Z

I'm pretty sure you can construct your reader with 'utf-8-sig' rather than 'utf-8', and the codec will strip the BOM for you.

Answer 2 · 2016-04-11T19:04:05.000Z

Thanks. That seems to have fixed it. Most of my CSVs are generated on Windows even though I'm processing them on Linux.