Lib does not work if cell contains \n characters

Question

Lib does not work if cell contains \n characters

reaver2oo3 opened this issue 6 years ago · 5 comments

Hello,

I have integrated your library into an application that I am working on and what I have found is that if a cell has text in it that is spread into multiple lines, then the application will crash. :(
Is there any hope for a fix?
A possible solution for this, that I have thought of is that, if a cell starts with a quote but it isn't found by the end of the line, then more lines should be read until the pair is found. I haven't managed to write a fix attempt yet, as I haven't looked into your code enough.

Kind regards,
Daniel

Answer 1 · 2020-02-26T11:17:12.000Z

Hi

I have the same problem. Cells with double quoted strings that contain \n inside are parsed incorrectly.

id, text
1, "abc
efg"

I have fixed this localy and it seems to be working.
Problem is in function LineReader::next_line, line 465, link.
Replace original block, lines 465-468 with this:

int line_end = data_begin;
bool is_in_string = false;
bool has_quote = false;
while(line_end != data_end){
        if (buffer[line_end] == '\"') {
            if (is_in_string)
                has_quote = !has_quote;
            else
                is_in_string = true;
        }
        else if (buffer[line_end] == '\n') {
            if (!is_in_string)
                break;
        }
        else {
            if (is_in_string && has_quote) {
                is_in_string = false;
                has_quote = false;
            }
        }
        ++line_end;
}

This code does not consider other quote_policies. This needs to be done correctly.

I'm using csv reader with double_qoute_escape policy like this:

CSVReader<2, trim_chars<>, double_quote_escape<',', '\"'>> csv(file);

and everything works for me.

Answer 2 · 2020-03-04T20:18:10.000Z

The issue with \n in quoted strings has been raised a lot in the past. This is known. The summary of the problems are:

The newline handling is in LineReader, which is not parameterizable. This make it difficult to do changes in a clean and backward compatible way.
It is unclear how to handle "\n" vs "\r\n" newlines. Do we want automatically translate these? Some people want to, others not. There a lot of corner cases that can lead to unexpected behavior. Not handling them at all does at least not silently do the wrong thing.
Error message generation gets a lot more complex because of possibly run-away problems if a quote is missing.
My personally opinion is that CSV is a text file and text files should be readable in a text editor. If a column spans multiple lines then this gets hard to read as records can no longer be copied line-wise. My opinion is thus that you are abusing the format, if you have literal newlines. The solution is to escape newlines.

Up to now I have not yet seen a good solution.

Answer 3 · 2020-03-05T10:42:38.000Z

We are processing many csv files downloaded from various webservices and/or generated by tools.
The \n inside the csv string is absolutely common, with no escaping. I think, the only escape used in csv files, is double double-quote "" to escape double-quote.

The problem of runaway due to missing closing double-quote is not a problem of parser, it is problem of generator, this should not be corrected by the parser.

Solving this by "correcting" csv before reading by escaping all multi-line records with some escape and then after reading unescape all escaped new lines is fairly inefficient and complicated.

Answer 4 · 2022-06-30T15:59:45.000Z

Muti-line cells are absolutely required for my use case. You cannot tell users that they can't put multiple lines in text boxes.

Answer 5 · 2022-07-02T17:52:29.000Z

Then you have to either escape new lines by replaceing them somehow, not use CSV, or look for a different library.

…

On 6/30/22 17:59, Exceter007 wrote: Muti-line cells are absolutely required for my use case. You cannot tell users that they can't put multiple lines in text boxes. — Reply to this email directly, view it on GitHub <#92 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IBDZKP2LDVFHPJ4IRBCDVRW77ZANCNFSM4KMWN6KA>. You are receiving this because you commented.Message ID: ***@***.***>