ReadableStreamStreamer sometimes breaks UTF8-characters to ��
Opened this issue · 13 comments
Issue
If I pass a CSV stream to Papa.parse
that contains special characters, it sometimes breaks the special characters so they show up as e.g. ��.
How to reproduce
See example at:
https://repl.it/repls/ArcticTreasuredLine
Press "Run" at the top of the page
What should happen?
There should only be output of ä
character
What happened?
There's random occurrences of ��
Root cause
These two lines are responsible for this issue:
Line 863 in ae73d2a
Line 506 in ae73d2a
So when papaparse reads a chunk, it directly calls .toString
of that chunk.
However a chunk consists of bytes, and some utf8-characters are two bytes long:
ä
consists of two bytes:11000011
and10100100
a
(and other "regular" characters) is just one byte01100001
Now if the chunk splits right between a multi-byte character like ä
, papaparse calls toString
to both parts of the character distinctly, and produces two weird characters:
11000011
(from end of first chunk) transforms to �
10100100
(from start of second chunk) transforms to �
How to fix this issue?
If received chunks of bytes, the concatenation should be done in bytes too, e.g. using Buffer.concat
. Papaparse should not call toString
before it has split the stream to lines, so the _partialLine
remains as a buffer rather than a string type.
Hi @jehna,
Thanks for reporting the issue.
What you propose makes sense to me. It will be great if you can provide a PR that adds a test case for this issue and fixes it.
Workaround
I implemented a workaround for this issue: Use another library to pre-parse the lines as a stream.
I'm using delimiter-stream
NPM package that seems to have a correct implementation of line parsing as a byte stream:
Using this library you can do a simple wrapper to wrap up your stream:
const toLineDelimitedStream = input => {
// Two-byte UTF characters (such as "ä") can break because the chunk might get
// split at the middle of the character, and papaparse parses the byte stream
// incorrectly. We can use `DelimiterStream` to fix this, as it parses the
// chunks to lines correctly before passing the data to papaparse.
const output = new DelimiterStream()
input.pipe(output)
return output
}
Using this helper function you can wrap the stream before passing it to Papa.parse
:
Papa.parse(toLineDelimitedStream(stream), {
...
})
Hey @pokoli
It will be great if you can provide a PR that adds a test case for this issue and fixes it.
I'll have to see if I can find some time to spend on this. Not making any promises yet 😄
@jehna How would this workaround look like for a browser implementation? Where instead of the stream I was actually passing the file from a file input field to the Papa.parse
function
@nichgalea if you don't mind the extra memory usage, I believe you can call .text()
on the file and pass the it to Papa parse as string.
So something like:
Papa.parse(await file.text());
I believe the issue still exists: the code responsible for the issue has remained untouched.
I believe the issue still exists: the code responsible for the issue has remained untouched.
I am handling csv and excel files with large chunk. What approach should I use?
@jehna
I used below logic
const encoding = chardet.detectFileSync(myFile);
const fileContent = fs.readFileSync(myFile);
const utf8Content = iconv.decode(fileContent, encoding);
now passing this to papaParse helps to solve the issue. I think this is the best and fastest solution.
Yep! As long as your CSV is small enough to fit the menory, you should treat it as a big string (like you did).
If you need to support huge csv files using a machine with limited memory, then streaming is the only option and you'll need to use the DelimiterStream
workaround.
Hi,
ran into the same issue. It only occurs with big files and only, if you process via browser (tried firefox and chromium). If you need a test case, you can try the html-code attached. To create the input file simply execute the following in a unix shell:
echo "Bereich;ID;Kuerzel;Indikator;Raumbezug;Kennziffer;Name;Zeitbezug;Wert" > testData.csv
for i in `seq 1 40000`; do echo "ÄÄÄÄÄÄÄÄÄÄÄ;ÖÖÖÖÖ;ÜÜÜÜÜÜÜÜÜ;ääääääääääääääääääääääääääääääääääää;öööööööööööööööööööööööööööööööööööööööööö;üüüüüüüüü;ßßßßßßßßßßß;2010;6,66" >> testData.csv; done
Then try to importfile testData.csv using the simple html page in a browser. For me it reports an error in line 39870. Not sure if this is the same for each OS and browser. Maybe you can try more than 40000 lines in case there is no error reported.
I think this is a No-Go for a library claiming to be reliable and correct. I really like the small footprint and the missing of dependencies and was very happy, that I found it. Like the interface, good work! But you should fix this issue (can't believe you managed to ignore this for such a long time).
Cheers, Zwmann
Hi,
ran into the same issue. It only occurs with big files and only, if you process via browser (tried firefox and chromium). If you need a test case, you can try the html-code attached. To create the input file simply execute the following in a unix shell:
echo "Bereich;ID;Kuerzel;Indikator;Raumbezug;Kennziffer;Name;Zeitbezug;Wert" > testData.csv for i in `seq 1 40000`; do echo "ÄÄÄÄÄÄÄÄÄÄÄ;ÖÖÖÖÖ;ÜÜÜÜÜÜÜÜÜ;ääääääääääääääääääääääääääääääääääää;öööööööööööööööööööööööööööööööööööööööööö;üüüüüüüüü;ßßßßßßßßßßß;2010;6,66" >> testData.csv; done
Then try to importfile testData.csv using the simple html page in a browser. For me it reports an error in line 39870. Not sure if this is the same for each OS and browser. Maybe you can try more than 40000 lines in case there is no error reported.
I think this is a No-Go for a library claiming to be reliable and correct. I really like the small footprint and the missing of dependencies and was very happy, that I found it. Like the interface, good work! But you should fix this issue (can't believe you managed to ignore this for such a long time).
Cheers, Zwmann
I have workaround it by using DelimiterStream
use case
Hi caoxing9,
I saw that there seems to be a solution using DelimiterStream and it's good that you worked it out! Thank you for that!
Nevertheless this bug is more than 4 years old and it's serious! So I think the fix should take its way into the code. I do not like to patch libraries and I wonder, why there wasn't a fix in such a long time.
Cheers, Zwmann