Extend support of CSV with CSV Dialect
nichtich opened this issue · 8 comments
First thanks for this great work! CSV format can be troublesome because of its many dialects. If support of CSV is going to be extended, I recommend using the CSV Dialect Specification or a compatible subset of it. By now, fq supports two CSV Dialect properties with differing name and default. An alternative is CSVW dialect description (thanks xkcd #972!). Here is a comparision:
fq | default | csvddf | default | csvw | default |
---|---|---|---|---|---|
comma | , |
delimiter | , |
delimiter | , |
comment | # |
commentChar | not set | commentPrefix | null or # (spec is ambiguous) |
Names can be adjusted by aliases, I prefer short names anyway. Remaining properties found in csvddf and csvw are:
csvddf | default | csvw | default |
---|---|---|---|
quoteChar | " |
quoteChar | " |
skipInitialSpace | false | skipInitialSpace | false |
header | true |
header | true |
being discussed | headerRowCount | 1 if header set else 0 | |
lineTerminator | \r\n |
lineTerminators | ["\r\n", "\n"] |
doubleQuote | true |
||
doubleQuote | true |
||
escapeChar | not set | ||
nullSequence | not set | ||
skipBlankRows | false |
||
skipColumns | 0 |
||
skipRows | 0 |
||
encoding | utf-8 |
||
trim | true |
Property doubleQuote differs in meaning between the two (csvw uses it to also set escapeChar). csvddf further has property caseSensitiveHeader
(default false) but there are discussions to remove it.
More specific, I'd first:
- rename
comma
todelimiter
and optionally keepcomma
as alias - change default of
comment
tonull
(most CSV in practice does not allow comments by default) - add
quoteChar
andskipInitialSpace
as csvd and csvw agree on those - add
header
to automatically convert rows to objects when set (default false although csvd and csvw both have true as default but this can be discussed).
More support of CSV dialect requires at least someone with experience in actually working with messy CSV data (e.g. users of mr) because authors of standards tend to add features without common use cases.
Hey, that is great and very helpful research, didn't know about any of those cvs dialect standards. I think you suggestions make sense to do. Is it something you would like to help out with coding-wise? might speed things up a it.
rename comma to delimiter and optionally keep comma as alias
Yeap good idea. The only (not great) reason it's called comma now is because that is was it's called in the csv parser used at the moment https://pkg.go.dev/encoding/csv#Reader
change default of comment to null (most CSV in practice does not allow comments by default)
Ok, so all lines will be treated as data?
add quoteChar and skipInitialSpace as csvd and csvw agree on those
add header to automatically [convert rows to objects](https://github.com/wader/fq/blob/master/doc/formats.md#convert rows-to-objects-based-on-header-row) when set (default false although csvd and csvw both have true as default but this can be discussed).
👍 could possibly also move convert to object code into go if doing in jq is slow
Maybe the csv decoder could have "dialect" option that is either a string that is a name of dialect or an object with settings?
One thing is to figure out if we could still use the csv parser in the golang standard library or needs to find another existing one or write one ourself.
Is it something you would like to help out with coding-wise? might speed things up a it.
I'm very motivated but have not coded in Go yet (should be doable and happy to learn) so the "might speed things up" does not apply. So it depends :-) But data formats are my research topic and I heavily use jq so sooner or later I need to dig deeper into fq anyway.
Ok, so all lines will be treated as data?
Yes, most CSV parsers don't enable comments by default.
Maybe the csv decoder could have "dialect" option that is either a string that is a name of dialect or an object with settings?
Yes but then you need to manage names of dialects. The only commonly agreed names I know are RFC 4180 and TSV (probably better as rfc4180
and tsv
). Most people don't document their data formats on this level with names but just assume csv ad supported by the software library they happen to use (and end up with incompatible edge cases).
One thing is to figure out if we could still use the csv parser in the golang standard library or needs to find another existing one or write one ourself.
The more dialect aspects are supported, the more the danger of having to write your own CSV library. That's why I'd first limit implementation to compatibility with a subset of CSVD and CSVW.
I'm very motivated but have not coded in Go yet (should be doable and happy to learn) so the "might speed things up" does not apply. So it depends :-) But data formats are my research topic and I heavily use jq so sooner or later I need to dig deeper into fq anyway.
Great, there is no hurry, was more if you wanted something fast :) i'm can help out with both go and jq stuff. Maybe a possible route is that i start look at it and see how much work it seems to be, possible some initial PR etc, and then we figure something out?
What kind of research are you doing? as a student, phd etc? curious. And i'm of course happy to help out other fq or format related things.
Yes but then you need to manage names of dialects. The only commonly agreed names I know are RFC 4180 and TSV (probably better as
rfc4180
andtsv
). Most people don't document their data formats on this level with names but just assume csv ad supported by the software library they happen to use (and end up with incompatible edge cases).
Aha i see. But it's nice that both csvddf and csvw has default values, so a fq decoder could always have that as quite safe fallback for properties not set?
Had no idea there was even efforts to standardize CSV like this, seems like good idea, is quite confusing. I've had to explain at least a couple of times that "export it as CVS" is sadly not that straight forward :) also run into issues with numbers in csv, which decimal symbol to use, that seems to the out of scope for csvddf and csvw?
The more dialect aspects are supported, the more the danger of having to write your own CSV library. That's why I'd first limit implementation to compatibility with a subset of CSVD and CSVW.
Yes true good point. So maybe try stick with standard library csv reader/writer as see how far it can go?
What kind of research are you doing? as a student, phd etc? curious.
I did my PhD thesis on patterns in data formats some years ago and I manage a structured register of data formats (in German, with focus on bibliographic data).
I did my PhD thesis on patterns in data formats some years ago and I manage a structured register of data formats (in German, with focus on bibliographic data).
Interesting and the thesis looks like something i will like to have a look at.
As you might have noticed fq currently does not support much when it comes to schemas or generic format description languages, like kaitai stuct etc, at the moment. But I think it should be possible to add in some form, at least for decoding, encoding is different kind of beast, at least for complex formats like mp4 etc.
Did some research about good test suits, csvw seems to have one in nice format https://github.com/w3c/csvw/tree/gh-pages/tests