Double quotes in double-quoted values not handled correctly
mhassan1 opened this issue · 10 comments
Double quotes inside double-quoted values cause the parser to stop parsing at the row containing the double quote. When I run this:
printf 'name,age\njoe,40\n"will"iam",30\n"sam",35\n"jan",25' | csv-parser
-------------------------------^ there's a double quote in here
I get this:
{"name":"joe","age":"40"}
{"name":"will\"iam","age":"30\n\"sam\""}
It escaped the double quote nicely, but the "age" field is wrong, and the third and fourth rows were never parsed at all.
Confirmed this is still an issue on the latest version. Will investigate, but a PR is welcome.
I've tried creating a fix for this: https://github.com/czaefferer/csv-parser/tree/patch-issue-70
With the files I've tested it seems to work, however this should be double-checked. I've modified the rules to start/stop quoted text: To start an unescaped quote must be found either at the very first character in the record, or directly after the separator. To stop quoted text an unescaped quote must either be the very last character in the record or it must be followed by a separator. All other quotes are handled like regular characters. But this also means a row like a; "b"
would result in the values 'a'
and ' \"b\"'
, since a space is the first character in the second cell, so the following quote won't be used to start a quoted text. So I'm not completely sure, what/how bad the side effects of this change are...
It's probably related, here's another bug when parsing twice the same content (with some escaped quotes), very surprisingly the result is not the same: https://repl.it/@caub/tsv
fast-csv works fine on this example
foo\tbar\tY\t"wat2\t""ok"""\tsomething\n
is first parsed correctly as
Row {
id: 'foo',
t: 'bar',
senderType: 'Y',
msg: 'wat2\t"ok"',
agentId: 'something' },
then the second time as
Row {
id: 'foo',
t: 'bar',
senderType: 'Y',
msg: '"wat2\t"ok""\tsomething',
agentId: undefined },
even for following data, it is not working as expected
I/P Data:
email
"mathi@sf.com,mathi@ki.com,mathi@co.com"
mathi@sf.com
expecting three rows, but getting only two rows with option {headers:false}
O/P Data:
email
mathi@sf.com,mathi@ki.com,mathi@co.com\r\nmathi@sf.com
Similar issue here:
<tag attr="val"/>a</tag>,b,c
@bittlingmayer your situation is an easy fix, quote the field, escape the inner quotes.
Just for information, my fix doesn't work correctly in streaming mode. I guess sometimes the \r\n gets split into two chunks and the line break isn't detected, and two rows get merged into one again (or something like that). I won't investigate further though, I'm using "csv-parse" now, which is slightly slower, but appears to be much more resilient.
The issue seems to be with this line... but if we change this condition we break other test cases. Will probably attempt to fix it
if (!this.state.quoted) { //Line no 239 }
After configing with the following paramters, it works fine to handle double quote "".
.pipe(csv({
separator: ',',
quote: '"',
escape: '\',
newline: '\n',
strict: true,
skipLines: 0,
skipRows: 0,
}))
we have the below code snippet
`const stream = require('stream');
const csv = require('csv-parser');
const HIGHEWATERMARK = 20 * 1024 * 1024;
// eslint-disable-next-line no-underscore-dangle
async function streamFileOnDockerFS(inputStream, fileName) {
inputStream.pipe(csv().on('data',data=>{
console.log('in file 1');
console.log(data);
}));
}
async function streamFileOnDockerFS2(inputStream, fileName) {
inputStream.pipe(csv().on('data',data=>{
console.log('in file 2');
console.log(data);
}));
}
(async () => {
const currentCSVStream = new stream.PassThrough({ highWaterMark: HIGHEWATERMARK });
// eslint-disable-next-line no-underscore-dangle
const header = '"promoname"';
const outRow = '"dummy""promo,name""dummy"';
currentCSVStream.push(${header}\n
);
for(let i=0;i<=5;i+=1)
currentCSVStream.push(${outRow}\n
);
currentCSVStream.push(null);
const prm1 = streamFileOnDockerFS(currentCSVStream);
const prm2 = streamFileOnDockerFS2(currentCSVStream);
const promises = [prm1,prm2];
await Promise.allSettled(promises);
})();
`
The output for the above snippet is as following
in file 1
{ promoname: 'dummy"promo,name"dummy' }
in file 2
{ promoname: 'dummy"promo,name"dummymy' }
in file 1
{ promoname: 'dummy"promo,name"dummy' }
in file 2
repetitions...
The output seems to be incorrect for file2