mholt/PapaParse

ReadableStreamStreamer sometimes breaks UTF8-characters to ��

Opened this issue · 13 comments

jehna commented

Issue

If I pass a CSV stream to Papa.parse that contains special characters, it sometimes breaks the special characters so they show up as e.g. ��.

How to reproduce

See example at:
https://repl.it/repls/ArcticTreasuredLine

Press "Run" at the top of the page

What should happen?

There should only be output of ä character

What happened?

There's random occurrences of ��

Root cause

These two lines are responsible for this issue:

queue.push(typeof chunk === 'string' ? chunk : chunk.toString(this._config.encoding));

var aggregate = this._partialLine + chunk;

So when papaparse reads a chunk, it directly calls .toString of that chunk.

However a chunk consists of bytes, and some utf8-characters are two bytes long:

  • ä consists of two bytes: 11000011 and 10100100
  • a (and other "regular" characters) is just one byte 01100001

Now if the chunk splits right between a multi-byte character like ä, papaparse calls toString to both parts of the character distinctly, and produces two weird characters:

11000011 (from end of first chunk) transforms to
10100100 (from start of second chunk) transforms to

How to fix this issue?

If received chunks of bytes, the concatenation should be done in bytes too, e.g. using Buffer.concat. Papaparse should not call toString before it has split the stream to lines, so the _partialLine remains as a buffer rather than a string type.

Hi @jehna,

Thanks for reporting the issue.

What you propose makes sense to me. It will be great if you can provide a PR that adds a test case for this issue and fixes it.

jehna commented

Workaround

I implemented a workaround for this issue: Use another library to pre-parse the lines as a stream.

I'm using delimiter-stream NPM package that seems to have a correct implementation of line parsing as a byte stream:

https://github.com/peterhaldbaek/delimiter-stream/blob/043346af778986d63a7ba0f87b94c3df0bc425d4/delimiter-stream.js#L46

Using this library you can do a simple wrapper to wrap up your stream:

const toLineDelimitedStream = input => {
  // Two-byte UTF characters (such as "ä") can break because the chunk might get
  // split at the middle of the character, and papaparse parses the byte stream
  // incorrectly. We can use `DelimiterStream` to fix this, as it parses the
  // chunks to lines correctly before passing the data to papaparse.
  const output = new DelimiterStream()
  input.pipe(output)
  return output
}

Using this helper function you can wrap the stream before passing it to Papa.parse:

Papa.parse(toLineDelimitedStream(stream), {
   ...
})
jehna commented

Hey @pokoli

It will be great if you can provide a PR that adds a test case for this issue and fixes it.

I'll have to see if I can find some time to spend on this. Not making any promises yet 😄

@jehna How would this workaround look like for a browser implementation? Where instead of the stream I was actually passing the file from a file input field to the Papa.parse function

jehna commented

@nichgalea if you don't mind the extra memory usage, I believe you can call .text() on the file and pass the it to Papa parse as string.

So something like:

Papa.parse(await file.text());

Hi @jehna @pokoli
Has the buffer logic added to the library?

I believe the issue still exists: the code responsible for the issue has remained untouched.

I believe the issue still exists: the code responsible for the issue has remained untouched.

I am handling csv and excel files with large chunk. What approach should I use?

@jehna
I used below logic
const encoding = chardet.detectFileSync(myFile);
const fileContent = fs.readFileSync(myFile);
const utf8Content = iconv.decode(fileContent, encoding);

now passing this to papaParse helps to solve the issue. I think this is the best and fastest solution.

Yep! As long as your CSV is small enough to fit the menory, you should treat it as a big string (like you did).
If you need to support huge csv files using a machine with limited memory, then streaming is the only option and you'll need to use the DelimiterStream workaround.

Hi,

ran into the same issue. It only occurs with big files and only, if you process via browser (tried firefox and chromium). If you need a test case, you can try the html-code attached. To create the input file simply execute the following in a unix shell:

echo "Bereich;ID;Kuerzel;Indikator;Raumbezug;Kennziffer;Name;Zeitbezug;Wert" > testData.csv
for i in `seq 1 40000`; do echo "ÄÄÄÄÄÄÄÄÄÄÄ;ÖÖÖÖÖ;ÜÜÜÜÜÜÜÜÜ;ääääääääääääääääääääääääääääääääääää;öööööööööööööööööööööööööööööööööööööööööö;üüüüüüüüü;ßßßßßßßßßßß;2010;6,66" >> testData.csv; done

Then try to importfile testData.csv using the simple html page in a browser. For me it reports an error in line 39870. Not sure if this is the same for each OS and browser. Maybe you can try more than 40000 lines in case there is no error reported.

I think this is a No-Go for a library claiming to be reliable and correct. I really like the small footprint and the missing of dependencies and was very happy, that I found it. Like the interface, good work! But you should fix this issue (can't believe you managed to ignore this for such a long time).

Cheers, Zwmann

Hi,

ran into the same issue. It only occurs with big files and only, if you process via browser (tried firefox and chromium). If you need a test case, you can try the html-code attached. To create the input file simply execute the following in a unix shell:

echo "Bereich;ID;Kuerzel;Indikator;Raumbezug;Kennziffer;Name;Zeitbezug;Wert" > testData.csv
for i in `seq 1 40000`; do echo "ÄÄÄÄÄÄÄÄÄÄÄ;ÖÖÖÖÖ;ÜÜÜÜÜÜÜÜÜ;ääääääääääääääääääääääääääääääääääää;öööööööööööööööööööööööööööööööööööööööööö;üüüüüüüüü;ßßßßßßßßßßß;2010;6,66" >> testData.csv; done

Then try to importfile testData.csv using the simple html page in a browser. For me it reports an error in line 39870. Not sure if this is the same for each OS and browser. Maybe you can try more than 40000 lines in case there is no error reported.

I think this is a No-Go for a library claiming to be reliable and correct. I really like the small footprint and the missing of dependencies and was very happy, that I found it. Like the interface, good work! But you should fix this issue (can't believe you managed to ignore this for such a long time).

Cheers, Zwmann

I have workaround it by using DelimiterStream
use case

Hi caoxing9,

I saw that there seems to be a solution using DelimiterStream and it's good that you worked it out! Thank you for that!
Nevertheless this bug is more than 4 years old and it's serious! So I think the fix should take its way into the code. I do not like to patch libraries and I wonder, why there wasn't a fix in such a long time.

Cheers, Zwmann