beatrichartz/csv

Adding an option to set up the quote_char

jumichot opened this issue · 6 comments

Hi ! I'm currently porting ETL ruby code in elixir and I have a problem in some files.

I'm on the master branch and I use the | as separator. I get an Exception about CSV.EscapeSequenceError when I have only one " in the line which can happen because this character is not intended to be used for escaping in my case. Here a line that trigger the exception :
TEST|RES "LES PRES LE ROY||ZZ

I was expecting on the master branch to receive a tuple for this error : {:error, EscapeSequenceError...}, can I do something for getting that instead the exception ?
The error message in the exception was wrong about the line number of the problem, I'll look why but there is maybe a bug here.

In the csv ruby library I can set a different character (the quote_char option) for escaping and it solved my problem if I put a character that I'm sure is not in the file : https://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html#method-c-new

Do you think that can of option could be introduced in the library ? I can try to submit a PR for that if it's ok

Ok I looked a little bit in the code :

  • I now why I get the exception, there is a raise during decoding csv/lib/csv/decoding/preprocessing/codepoints.ex:31
  • and I found the variable @double_quote in lib/csv/defaults.ex, I think it's what I want to overidde with my quote_char option

Interesting, thanks for sharing - Your usecase is a "stray quote", which is considered illegal as per RFC4180 - I have seen them in quite a few files in the wild and I am currently considering an option to allow them - something like CSV.decode(allow_stray_quotes: true) - I think that might be nicer than overriding the double quote since you do not want to really replace the quote, just get rid of its function as a escape sequence marker.

As you rightly pointed out this also currently raises an error in every case in latest master. This would need some kind of track back mechanism disassemble lines collected into an escape sequence into "normal" lines.

The master branch is currently under active development, I hope I can get some updates in soon to start some of these enhancements.

I stumbled on this when trying to use \" as the escape sequence (because the data was formatted and escaped from a DB dump):

y,"<p>yellow, \"no, blue!</p>"

This produces:

 ** (CSV.LineAggregator.CorruptStreamError) Stream halted with unterminated escape sequence
     stacktrace:
       lib/csv/line_aggregator.ex:36: anonymous fn/1 in CSV.LineAggregator.aggregate/2

When using a different separator (?| as with jumichot), I get the error when the value isn't escaped (i.e., it shouldn't be a separator to escape):

y|yellow? no, "blue

The side effects of this can be confusing. I had a situation where a 644 line input file only parsed 511 of the lines, but silently continued on. I only discovered the data was missing when I queried the database. I've also seen the error reported as having the wrong number of columns (when using headers), presumably because there was a comma inside the sequence after a failed escape. (Sorry, I don't have an example of this one, as it came up while I was debugging a few different things.)

@defndaines you can escape quotes in csv documents by preceding them with another quote rather than a backslash, i.e. the proper escaping for your example according to RFC 4180 is:

y,"<p>yellow, ""no, blue!</p>"

This should get parsed correctly. Your second example is a stray quote which with the current implementation in master will produce a CSV.EscapeSequenceError when the stream is halted. If the stream is not halted CSV will wait for more input to collect into the escape sequence until it sees a single double quote marking the end of it.

Curious, what type of db are you using and what program is used to dump the data to the csv file you're reading in?

I got bitten by this as well. Is there a simple workaround that can be used inside an Elixir-program?

You can now use the :escape_character option in version 3.0.0. Feel free to reopen if there are issues using it for what you want to do.