tilo/smarter_csv

Check input file for BOM / Byte Order Mark (REGRESSION?)

tilo opened this issue · 2 comments

tilo commented

some CSV files contain a Byte Order Mark
https://en.wikipedia.org/wiki/Byte_order_mark

e.g.

$ hexdump -C /tmp/sample.csv
00000000  ef bb bf 75 73 65 72 5f  69 64 2c 74 79 70 65 2c  |...user_id,type,|
00000010  6d 65 74 61 6c 5f 70 69  64 0d 0a 34 33 32 31 30  |metal_pid..43210|
00000020  38 30 35 2c 72 65 69 73  73 75 65 2c 31 32 33 34  |805,reissue,1234|

First 3 bytes ef bb bf should be ignored

Other BOM Markers:

* UTF-8 with BOM: EF BB BF
* UTF-16BE (big-endian): FE FF
* UTF-16LE (little-endian): FF FE
* UTF-32BE (big-endian): 00 00 FE FF
* UTF-32LE (little-endian): FF FE 00 00
tilo commented

HINT: this is typically caused by some Microsoft tools.
A way to fix this is to run dos2unix filename

tilo commented

fixed in #220