tilo/smarter_csv

v1.8.5 introduces a behavior change when handling trailing backslashes in quoted columns

averycrespi-moz opened this issue · 6 comments

In version 1.8.5, SmarterCSV introduced logic to parse \" like an escaped double quote.

The justification for this change was that "some tools will generate an escaped value of \" instead of """.

However, this change breaks RFC4180-compliant CSV files which have trailing backslashes in quoted columns.

For example, trying to parse this CSV file:

first,second
"foo \",bar

with this code:

require 'smarter_csv'

SmarterCSV.process('test.csv') do |chunk|
  chunk.map do |row|
    puts row
  end
end

raises the following error:

/Users/averycrespi/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/smarter_csv-1.9.0/lib/smarter_csv.rb:74:in `readline': end of file reached (EOFError)
        from /Users/averycrespi/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/smarter_csv-1.9.0/lib/smarter_csv.rb:74:in `process'
        from repro.rb:3:in `<main>'

Can you please consider either reverting this change, or adding a new option to control this behaviour?

Let me know if you have any questions, or if I've overlooked something in the docs.

Thanks!


System information:

  • Ruby 3.2.2
  • SmarterCSV 1.9.0 (but this issues impacts >= 1.8.5)
  • macOS Ventura 13.4.1 (c)
tilo commented

@averycrespi-moz can you please provide a small sample file for testing?

Sure, here's a simple file adapted from one of your existing test fixtures:
quoted_trailing_backlash.csv

tilo commented

RFC4180 does not specifically talk about the use of escaped quote characters\", and also does not talk about if escaped quotes can be un-balanced like in your test file. ChatGPT agrees.

When doing the change in version 1.8.5 it was a conscious decision to only allow balanced quotes - in particular: balanced escaped quotes.

Data line 1 and line 2 do not have balanced quotes, and are not considered valid.

However the line 3,"Billy \\",Bob should be valid, and read as first name Billy \

RFC4180 does not specifically talk about the use of escaped quote characters", and also does not talk about if escaped quotes can be un-balanced like in your test file. ChatGPT agrees.

You're right, RFC4180 doesn't talk about using backslashes to escape quotes. The issue is that those backslashes should be interpreted literally (NOT as an escape character) according to RFC4180, but some CSV parsers choose to extend RFC4180 by redefining a backslash as an escape character.

When doing the change in version 1.8.5 it was a conscious decision to only allow balanced quotes - in particular: balanced escaped quotes.

Allowing only balanced escaped quotes makes sense. However, the issue here is whether or not a backslash should be treated as an escape character when :quote_char is set to " in the SmarterCSV options.

Data line 1 and line 2 do not have balanced quotes, and are not considered valid.

This is only true if the backslash is treated an as escape character. If the backslash is a literal, then 1,"John\",Doe is balanced.

TL;DR: RFC4180 only defines the double quote as an escape character, but some CSV parsers also defines the backslash as an escape character. As a user, it would be nice to be able to configure whether or not a backslash is treated as an escape character.

E.g. if there was a :escape_backslash option:

  • escape_backslash: true would mean 1,"John\",Doe is unbalanced
  • escape_backslash: false would mean 1,"John\",Doe is balanced

Thanks for your time, and hopefully that makes sense!

tilo commented

the issue here is whether or not a backslash should be treated as an escape character when :quote_char is set to " in the SmarterCSV options.

@averycrespi-moz The escape character \ is the standard escpape character used on UNIX systems, like LINUX, MacOS, and others, as well as on Windows systems. It is also used by many programming languages. It is built-in to those systems and languages.

I made the decision to keep with requiring balanced quotes, and will fix the issue with \\ at the end of a line. This will be fixed in version 1.9.2. The README will also call out these assumptions / limitations.

tilo commented

Fix for fields ending in // was released in version 1.9.2