Lib does not work if cell contains \n characters
Opened this issue · 5 comments
Hello,
I have integrated your library into an application that I am working on and what I have found is that if a cell has text in it that is spread into multiple lines, then the application will crash. :(
Is there any hope for a fix?
A possible solution for this, that I have thought of is that, if a cell starts with a quote but it isn't found by the end of the line, then more lines should be read until the pair is found. I haven't managed to write a fix attempt yet, as I haven't looked into your code enough.
Kind regards,
Daniel
Hi
I have the same problem. Cells with double quoted strings that contain \n inside are parsed incorrectly.
id, text
1, "abc
efg"
I have fixed this localy and it seems to be working.
Problem is in function LineReader::next_line, line 465, link.
Replace original block, lines 465-468 with this:
int line_end = data_begin;
bool is_in_string = false;
bool has_quote = false;
while(line_end != data_end){
if (buffer[line_end] == '\"') {
if (is_in_string)
has_quote = !has_quote;
else
is_in_string = true;
}
else if (buffer[line_end] == '\n') {
if (!is_in_string)
break;
}
else {
if (is_in_string && has_quote) {
is_in_string = false;
has_quote = false;
}
}
++line_end;
}
This code does not consider other quote_policies. This needs to be done correctly.
I'm using csv reader with double_qoute_escape policy like this:
CSVReader<2, trim_chars<>, double_quote_escape<',', '\"'>> csv(file);
and everything works for me.
The issue with \n in quoted strings has been raised a lot in the past. This is known. The summary of the problems are:
- The newline handling is in LineReader, which is not parameterizable. This make it difficult to do changes in a clean and backward compatible way.
- It is unclear how to handle "\n" vs "\r\n" newlines. Do we want automatically translate these? Some people want to, others not. There a lot of corner cases that can lead to unexpected behavior. Not handling them at all does at least not silently do the wrong thing.
- Error message generation gets a lot more complex because of possibly run-away problems if a quote is missing.
- My personally opinion is that CSV is a text file and text files should be readable in a text editor. If a column spans multiple lines then this gets hard to read as records can no longer be copied line-wise. My opinion is thus that you are abusing the format, if you have literal newlines. The solution is to escape newlines.
Up to now I have not yet seen a good solution.
We are processing many csv files downloaded from various webservices and/or generated by tools.
The \n inside the csv string is absolutely common, with no escaping. I think, the only escape used in csv files, is double double-quote "" to escape double-quote.
The problem of runaway due to missing closing double-quote is not a problem of parser, it is problem of generator, this should not be corrected by the parser.
Solving this by "correcting" csv before reading by escaping all multi-line records with some escape and then after reading unescape all escaped new lines is fairly inefficient and complicated.
Muti-line cells are absolutely required for my use case. You cannot tell users that they can't put multiple lines in text boxes.