Remove byte-order markers from CSV files?
robinhouston opened this issue · 9 comments
I’ve noticed that Excel now saves UTF-8 CSV files with a BOM. (I’m using Microsoft Excel for Mac version 15.33, saving in “CSV UTF-8” format.)
When such files are parsed with csvParse
, the key corresponding to the first column has a zero-width non-breaking space as its first character, which leads to a situation where d["keyName"]
is undefined even though keyName
appears when you print out d
!
I’m not sure whether you think this should be addressed in the parser – if not it should at least be documented I think.
Can you attach an example file I can use for testing purposes?
Sure! GitHub won’t let me attach a .csv file, so I’ve zipped it.
Workbook1.csv.zip
As TXT: Workbook1.txt
Interestingly if you use FileReader.readAsText, it automatically strips the BOM bytes for you, per the Encoding specification.
Seems like XMLHttpRequest and Fetch also automatically strip the BOM. Here’s a CORS-accessible URL I tested:
So my question is how are you getting a string with the BOM still in it? It seems like the BOM stripping should happen earlier, before it gets to d3-dsv.
Sorry, I should have included a complete repro. I’m getting this in node, by fs.readFile(filename, "utf8", …)
. It looks as though the node developers have decided against stripping BOMs by default.
It’s okay if I should handle this in the app: I just thought I should flag it.
Okay. I’m going to close this issue. If you want to submit a pull request with an edit to the README suggesting that Node users use strip-bom that would be 💯 .
Great, will do!
Gah this just got me too. Could we consider adding it directly to d3-dsv? I think the code is considerably shorter than the comment in the README, plus I wasted a good ten minutes, thanks Excel!