ben-strasser/fast-cpp-csv-parser

Lib does not work if cell contains \n characters

Opened this issue · 5 comments

Hello,

I have integrated your library into an application that I am working on and what I have found is that if a cell has text in it that is spread into multiple lines, then the application will crash. :(
Is there any hope for a fix?
A possible solution for this, that I have thought of is that, if a cell starts with a quote but it isn't found by the end of the line, then more lines should be read until the pair is found. I haven't managed to write a fix attempt yet, as I haven't looked into your code enough.

Kind regards,
Daniel

Hi

I have the same problem. Cells with double quoted strings that contain \n inside are parsed incorrectly.

id, text
1, "abc
efg"

I have fixed this localy and it seems to be working.
Problem is in function LineReader::next_line, line 465, link.
Replace original block, lines 465-468 with this:

int line_end = data_begin;
bool is_in_string = false;
bool has_quote = false;
while(line_end != data_end){
        if (buffer[line_end] == '\"') {
            if (is_in_string)
                has_quote = !has_quote;
            else
                is_in_string = true;
        }
        else if (buffer[line_end] == '\n') {
            if (!is_in_string)
                break;
        }
        else {
            if (is_in_string && has_quote) {
                is_in_string = false;
                has_quote = false;
            }
        }
        ++line_end;
}

This code does not consider other quote_policies. This needs to be done correctly.

I'm using csv reader with double_qoute_escape policy like this:

CSVReader<2, trim_chars<>, double_quote_escape<',', '\"'>> csv(file);

and everything works for me.

The issue with \n in quoted strings has been raised a lot in the past. This is known. The summary of the problems are:

  • The newline handling is in LineReader, which is not parameterizable. This make it difficult to do changes in a clean and backward compatible way.
  • It is unclear how to handle "\n" vs "\r\n" newlines. Do we want automatically translate these? Some people want to, others not. There a lot of corner cases that can lead to unexpected behavior. Not handling them at all does at least not silently do the wrong thing.
  • Error message generation gets a lot more complex because of possibly run-away problems if a quote is missing.
  • My personally opinion is that CSV is a text file and text files should be readable in a text editor. If a column spans multiple lines then this gets hard to read as records can no longer be copied line-wise. My opinion is thus that you are abusing the format, if you have literal newlines. The solution is to escape newlines.

Up to now I have not yet seen a good solution.

We are processing many csv files downloaded from various webservices and/or generated by tools.
The \n inside the csv string is absolutely common, with no escaping. I think, the only escape used in csv files, is double double-quote "" to escape double-quote.

The problem of runaway due to missing closing double-quote is not a problem of parser, it is problem of generator, this should not be corrected by the parser.

Solving this by "correcting" csv before reading by escaping all multi-line records with some escape and then after reading unescape all escaped new lines is fairly inefficient and complicated.

Muti-line cells are absolutely required for my use case. You cannot tell users that they can't put multiple lines in text boxes.