beatrichartz/csv

Consider \r as a valid delimiter by default?

asterite opened this issue · 11 comments

Files coming from Windows users will just contain \r as a newline delimiter, and I think it's good if the default decode method supports this out of the box. Otherwise there's no good/fast way to parse a CSV like that: we'd need to read each line, trim it, and then decode it. Using \r as a delimiter doesn't work because if \r\n appears in the file then \n will be part of the next line. \n could work, but then we have to trim each column's value.

What do you think?

Hi! Thanks for raising this. Could you outline your suggestion with an example?

The library currently takes line by line input stream for decode so it does no splitting of lines by itself. That stream can come from a file, but it does not have to. When encoding, the standard delimiter it uses is \r\n, which is the MS DOS CRLF.

I mean, when reading from a stream the CSV library should decode lines itself, and consider all of \r, \n and \r\n as valid line delimiters. Otherwise, how can you parse a CSV file where you don't know what line delimiter is used? I can't find an easy way to do it.

An example in Ruby:

$ irb
irb(main):001:0> require "csv"
=> true
irb(main):002:0> CSV.parse("one\ntwo\nthree")
=> [["one"], ["two"], ["three"]]
irb(main):003:0> CSV.parse("one\rtwo\rthree")
=> [["one"], ["two"], ["three"]]
irb(main):004:0> CSV.parse("one\r\ntwo\r\nthree")
=> [["one"], ["two"], ["three"]]

As you can see, CSV.parse worked well out of the box for any line delimiter, without me having to specify what delimiter to use. This is important as CSV is usually parsed from files uploaded by users where you don't know what delimiter is used, and you want the parsing to work in all cases.

Ok I see where you're coming from. This library is designed for reading lines as input from a stream, so it does not play a role in splitting the lines up. To give an example, here's how we would decode a file with the code in latest master and an assumed test.csv file with the content

one
two
three
iex> File.stream!("test.csv") |> CSV.decode! |> Enum.each(&IO.inspect/1)
["one"]
["two"]
["three"]
:ok

The File.stream! will split the file into lines, and CSV.decode! will then take in the lines and decode them into lists or maps depending on your header setting. Enum.each(&IO.inspect/1) runs the stream and prints the decoded values.

To construct an example similar to yours where strings are placed inline, we'd use StringIO to get an IO device, which we can then stream using IO.stream/2:

iex> {:ok, string_device1} = StringIO.open("one\ntwo\nthree")
{:ok, #PID<0.x.0>}
iex> {:ok, string_device2} = StringIO.open("one\rtwo\rthree")
{:ok, #PID<0.y.0>}
iex> {:ok, string_device3} = StringIO.open("one\r\ntwo\r\nthree")
{:ok, #PID<0.z.0>}

iex> string_device1 |> IO.stream(:line) |> CSV.decode! |> Enum.each(&IO.inspect/1)
["one"]
["two"]
["three"]
:ok

iex> string_device2 |> IO.stream(:line) |> CSV.decode! |> Enum.each(&IO.inspect/1)
["onetwothree"]
:ok

iex> string_device3 |> IO.stream(:line) |> CSV.decode! |> Enum.each(&IO.inspect/1)
["one"]
["two"]
["three"]
:ok

We can take CSV out of the pipeline to isolate the mechanism of splitting into lines:

iex> {:ok, string_device1} = StringIO.open("one\ntwo\nthree")
{:ok, #PID<0.x.0>}
iex> {:ok, string_device2} = StringIO.open("one\rtwo\rthree")
{:ok, #PID<0.y.0>}
iex> {:ok, string_device3} = StringIO.open("one\r\ntwo\r\nthree")
{:ok, #PID<0.z.0>}

iex> string_device1 |> IO.stream(:line) |> Enum.each(&IO.inspect/1)
"one\n"
"two\n"
"three"
:ok

iex> string_device2 |> IO.stream(:line) |> Enum.each(&IO.inspect/1)
"one\rtwo\rthree"
:ok

iex> string_device3 |> IO.stream(:line) |> Enum.each(&IO.inspect/1)
"one\n"
"two\n"
"three"
:ok

\r\n gets mapped to \n, \n stays, and \r is not accepted as a newline character. CSV then takes these lines and removes any \r or \n, assuming that valid newlines have already been split. So Erlang / Elixir do not treat \r as a valid newline, which is most likely due to it only being used on old Mac OS (pre OSX) systems that way, which have been released last in 2001.

The CSV decoder is designed that way to make it agnostic to where the input is coming from, what size it has and where it is going: You can read from a file, from user input, from the network, and start streaming parsed output immediately to somewhere else. Splitting the input into lines is an external concern which is done better by the many libraries Elixir provides.

Thank you for the very detailed reply!

There's one thing that might be wrong with what the CSV library assumes: that parsing should be done line by line. According to the RFC:

   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

So in the above example there's a newline that is part of the first row. We can see this in Ruby:

require "csv"

string = CSV.generate do |csv|
  csv << ["aaa", "bb\nb", "ccc"]
end
puts string

Output:

aaa,"bb
b",ccc

And Crystal:

require "csv"

string = CSV.build do |csv|
  csv.row ["aaa", "bb\nb", "ccc"]
end
puts string

Output:

aaa,"bb
b",ccc

As you can see, parsing line by line is not the correct way to do it. The CSV parser should go char by char, and when a double quote is found, any newline after it, and until another double quote, must be considered as a newline of that cell.

Now, given that a CSV parser must take care of newlines itself (that is, not get lines from a File.stream or an Enum.stream), it would also be nice if it automatically considered all of \r, \n and \r\n as newlines.

In fact right now the encoder doesn't respect the RFC:

result = Encoder.encode([["aaa", "bb\nb", "ccc"]]) |> Enum.take(1)
IO.puts(result)

Output:

aaa,"bb\nb",ccc

It seems newlines are encoded as blackash+n, but that's not that the RFC says.

Just a quick reply, will follow up with more detail later: the library does parse multiline escapes correctly, but yes I have thought about parsing by byte :)

So CSV does parse multiline escapes in a preprocessing step by aggregating escaped lines as per the RFC - there are detailed tests in the test suite if you want to have a look.

Byte by byte parsing would be another way to do this, and I'd love to have this in the library as another valid preprocessing step so a user could plug it into their pipeline as needed. Working on this, but a PR would definitely be welcome. I don't think that such a preprocessor should treat CR as a newline by default, given that Erlang and Elixir don't treat it that way, but it should be configurable.

The encoder currently does not replace LF with CRLF in escaped fields. Whether that's a violation of the RFC rules is debatable: given that an escaped field can contain any character, treating LF as an exception and replacing it with CRLF would be unwarranted imo since the user gave it to the encoder as LF.

I thought about this for quite a while now, and I've also gone through implementing a prototype preprocessor for CSV taking in byte by byte input.

There are two questions here: There is a more general problem of Erlang (and therefore Elixir) not supporting a more agnostic way of reading files in, and then there is an architectural decision for this library whether it should take byte by byte stream input. Solving one question won't solve the other and vice versa, so let's take them one by one.

To describe question 1 in more detail, File.Stream uses IO.each__stream which uses :io.get_line. LF and CRLF are handled, but not CR. I can offer no solution so far, but I want to help. I've created a repo for legacy file reading where I will work on a solution to provide a File.stream!-like API for legacy files. I hope to be able to implement this soon.

About question 2, I have to say that I understand the desire for the convenience of a something like a CSV.read_file method or such. However the thing I like about Elixir is that we can follow the UNIX philosophy more closely: streams of text in as discrete lines, CSV does one thing well, and out come the rows which you can pipe into something else. Simple. We don't have to change this, as your issue and #62 can be solved by addressing question 1.

I will close this, but will update you as soon as LegacyFile is usable.

Btw, this is also a case with Excel 2011 (and probably later) which is a source of a lot CSV files and that much legacy :)
@beatrichartz if you have even temporary Unicode-safe solution it will be great.

I have also run into this issue.

+1 on the Excel thing. I run into this constantly.

So far only solution is to determine delimiter manually