Encoding issue when input is non-ASCII
Opened this issue · 10 comments
Does it work of you copy-and-paste the data in?
Well, my data is comma separated and the paste operation pastes the entire line into the first cell (without splitting at the commas). But yes, the ö I tried pasting got in correctly. It'll be related to reading the file, I spent 5 minutes to look for the code but didn't find it. The original Mail Merge has a charset selector. It's possible that the file reader assumed ANSI encoding, so windows-1252 in my case.
(OT: From your name I thought you're from a Nordic/Scandinavian country where they have plenty of non-ASCII stuff in their names, but looks like your from Canada.)
Spreadsheets are processed here:
I'm not familiar with any of the code. However, line 17 and 18 read:
// parse an array containing raw bytes from a spreadsheet of some
// format. XLSX will auto-detect the format
Reading "raw" bytes from a file into a "binary" JS string doesn't give a valid JS (UTF-16) string. Apparently that's exactly that happened. The accented characters î ö and ú are represented as two-byte codepoints in UTF-8. If you copy those two bytes into UTF-16 characters in the JS string, you get exactly the 6 characters you can see in the initial comment.
Where does the reading from file happen and where is parseSpreadsheet(data)
called? Or how is data
filled?
If your raw bytes contained CJK (Chinese, Japanese or Korean), you'd get total mojibake. So any "raw" bytes you read from a file need to be interpreted as some encoding. Given that UTF-8 is very common, you could just assume everything is UTF-8 and convert UTF-8 to the JS string (UTF-16). If that fails, well, the input wasn't UTF-8. There are two more options: 1) Automatic charset detection (which might yield the wrong result) as it applies some heuristics, or 2) a selector like the original "Mail Merge" has.
Have you tried with some CJK input at all? Here's some: テストテストテスト
Let me know if you need code snippets for charset conversion.
Could you please attach a CSV I could test on?
Okay. I just released version 2.6.0, which hopefully fixes the issue!
However, if you open the attached file with is encoded in ANSI/win-1252, you get this:
The high-bit characters which are not valid UTF-8 are displayed with the UTF-8 replacement character <?>, see here:
https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/fatal
IMHO it would be good to show some sort of error to the user, like: Input is not encoded in UTF-8.
Okay. I have added https://www.npmjs.com/package/jschardet to auto-detect the encoding. Unfortunately, it detects your Windows-1252 as ISO-8859-7, and so some characters are not correct. I think it's an improvement though.
I don't know how to detect if there is an "error" in the encoding.
Thanks. A lot could be written about charset detection. There are various libraries. They all use heuristics, that is the likeliness of certain characters occurring together in a text. Of course that works better on larger text than just some small snippet or name. Of course Móüse
makes no sense since it's mixing a character used in Spanish with one used in German, and also in Spanish, but not in that combination. So the heuristic gets tripped up. I'm sure the confidence wasn't very high.
Works OK with this file, also in window-1252:
May I suggest that you display the detected charset somewhere in the UI.