Make row length check optional
lessless opened this issue · 7 comments
Hello,
thank you for the wonderful library, it's a pleasure to use it except one pain point: a hard check on a row length.
If there are no any strong objections it can be a wonderful addition.
Interesting, thanks for raising this - If I understand correctly that would be something like CSV.decode(expected_row_length: 5)
?
Rather CSV.decode(fixed_length: false)
, so the check won't be performed at all and dealing with incomplete / too long lines will be in application responsibility.
As result
Bruce,Wayne,bruce@wayne.com
Peter,
James, Howlett,james@howlett.com,49
will be parsed into
[
["Bruce", "Wayne", "bruce@wayne.com"],
["Peter", nil],
["James", "Howlett", "james@howlett.com", "49"]
]
Interesting suggestion - what program is used to encode csv files omitting the separators? If this is intended encoding, is there a reason to do it that way (e.g. disk space limitations)?
This is challenging and can lead to interesting situations like header rows being shorter than data rows, which would throw away data. I can see where you're coming from, however I would be inclined to suggest to properly encode the files before feeding them in.
I think I have the same kind of problem. I have to deal with some weird csv (pipe as separator, and no escape character) that I don't produce, and can't fix. On the master branch you made easy for me to identified these line, which is great because now I can filter and generate nice error reports with pattern matching on the lines with {:error, "..."}
.
In the ruby program I tried to port in elixir, the ruby csv library don't seems to check the length, so actually when I have extra columns that shouldn't be here, ruby still gives me the row. These extra columns are in the end of lines and are export errors I think. But these lines without the extra column at the end are valid and have valuable info for me I could extract.
So maybe when you return the Error {:error, "Row has length 30 - expected length 29 on line 45"}
, insert the row in the tupple ? So I can have a chance of doing something with it and extract the info I would want.
Actually, after doing a research I found that all rows must be the same length https://www.ietf.org/rfc/rfc4180.txt
Each line should contain the same number of fields throughout the file.
So a situation when a line is shorter/longer than the others is a clear violation of standard and throwing an error is a very legit behavior.
Also, because of that error can be easily cached with rescue
clause and because of changing format of the error message is orthogonal to original subject I'm closing this issue.
Not very flexible to just look at the spec, I am also getting files from supplier that are missing seperators at the end of the line, so now I have to manually go past each line and add them myself.
Having a check that just omits the check for line length would be easy and as a user of the lib a good way to get around the issue.
@MarkNijhof is validate_row_length: false
working for your case? If no, can you post an example of the data you're dealing with?