osiegmar/FastCSV

Support for coping with invalid quote chars

dankito opened this issue · 7 comments

First of all thanks a lot for your great work and this amazing library!

In a project i have to cope with some real world data from an external provider that contains quote chars within a cell like:

;"de:14612:735:1";"landwärts (Straße "Altkaitz")";"51,014313

"de:14627:4000:4";"am "Ross" (RV)";"51,163477"

"de:14628:1148:1";"Ri. "Am Windberg"";"51,002455"

To cope with these some little changes to RowReader would be required:

    boolean consume(final RowHandler rh, final char[] lBuf, final int lLen) {
        // ...

                        if (c == qChar) {
                            if (isInvalidQuoteChar(lPos, lLen, lBuf) == false) { // ignores invalid quote chars
                                lStatus &= ~STATUS_QUOTED_MODE;
                            }
                            continue mode_check;
                        }

        // ...
  }

    private boolean isInvalidQuoteChar(int lPos, int lLen, char[] lBuf) {
        if (lPos < lLen) {
            char nextChar = lBuf[lPos];
            if (nextChar != fsep && nextChar != CR && nextChar != LF) {
                return true;
            }
        }
        return false;
    }

So i would kindly ask you if it would be possible to integrate it?
If so i could add an option to CsvReader(Builder) and provide a Pull Request for this.

Thanks for your feedback!

I do have a few concerns about adding this functionality as adding support for "even more broken" CSV data often comes with:

  • unwanted side effects (bugs)
  • higher complexity / lower maintainability
  • degraded performance (a major goal of FastCSV)

Although, I'm not completely against it. It highly depends on the implementation and test cases – including performance checks. I'd be curious to see a PR but I can't promise to merge.

Your example would probably not work when the quote character sits exactly on the last buffer position (when you have no nextChar without re-fetching data). Your third example ("de:14628:1148:1";"Ri. "Am Windberg"";"51,002455") raises an interesting question – what to do if the fields "seem" to contain valid and invalid quote characters? The result would only be Ri. "Am Windberg" if ignoring a proper escaped quote character!

BTW: I did a quick check within the https://github.com/osiegmar/JavaCsvComparison project and found that none of the tested libraries seem to support this broken format (in the way you're expecting it).

You're right in all three points.
New features (almost always) lead to higher complexity.
I introduced a boolean flag so the performance degree should be negligible.
For tests see the PR.

"when the quote character sits exactly on the last buffer position":
That's a good point and here i need your help.
I made it that if lPos == lLen more data gets loaded:

if (c == qChar) {
    if (ignoreInvalidQuoteChars && lPos == lLen) { // end of buffer reached we first need to fetch more data
        // TODO: but on next iteration lPos is then lPos + 1 so quote char won't be checked again
        continue mode_check;
    }
    if (ignoreInvalidQuoteChars == false || isInvalidQuoteChar(lPos, lLen, lBuf) == false) {
        lStatus &= ~STATUS_QUOTED_MODE;
    }
    continue mode_check;
}

But then on next iteration lPos is incremented by one so that the quote char won't be checked again.
So invalid quote chars at the last position of the buffer work, see test case ignoreInvalidQuoteChars_InvalidQuoteCharAtEndOfBuffer().
But valid quote chars at the end of the buffer don't work this way, see test case ignoreInvalidQuoteChars_ValidQuoteCharAtEndOfBuffer().

Can you give me any hint how to solve this?
(I could introduce a property recheckLastCharacter if that's in your sense.)

This thing reminds me a bit of #67.

Probably you need to set the status to something you could check on in the next loop iteration – something like lStatus |= STATUS_LAST_CHAR_WAS_QUOTE before the continue mode_check.

That was the perfect hint, thank you! Now it works exactly as expected.

Thanks for your work on this. I studied the PR and thought quite some time about this. I also re-read the RFC and its upcoming revision.

Unfortunately I came to the conclusion, that I won't merge this into FastCSV.

I decided so, mainly because of the increased complexity, the fact that this ill-formed CSV format seems to be an extreme edge case and the main objective(s) of this project. Not even the corner cases in the RFC revision mentions unescaped quotes within quoted fields – only different quote escape characters.

Also I'm thinking about another CSV library for quite some time. A CSV library that main objectives could be clean code, maintainability and expandability. I'll keep this PR in mind for that.

I know it's licensed under MIT license, but would like to ask anyway: Would you mind me then forking your project and releasing it as Kotlin (Multiplatform) version including this fix?

Nope, feel free to do so.