wader/fq

Extend support of CSV with CSV Dialect

nichtich opened this issue · 8 comments

First thanks for this great work! CSV format can be troublesome because of its many dialects. If support of CSV is going to be extended, I recommend using the CSV Dialect Specification or a compatible subset of it. By now, fq supports two CSV Dialect properties with differing name and default. An alternative is CSVW dialect description (thanks xkcd #972!). Here is a comparision:

fq default csvddf default csvw default
comma , delimiter , delimiter ,
comment # commentChar not set commentPrefix null or # (spec is ambiguous)

Names can be adjusted by aliases, I prefer short names anyway. Remaining properties found in csvddf and csvw are:

csvddf default csvw default
quoteChar " quoteChar "
skipInitialSpace false skipInitialSpace false
header true header true
being discussed headerRowCount 1 if header set else 0
lineTerminator \r\n lineTerminators ["\r\n", "\n"]
doubleQuote true
doubleQuote true
escapeChar not set
nullSequence not set
skipBlankRows false
skipColumns 0
skipRows 0
encoding utf-8
trim true

Property doubleQuote differs in meaning between the two (csvw uses it to also set escapeChar). csvddf further has property caseSensitiveHeader (default false) but there are discussions to remove it.

More specific, I'd first:

  • rename comma to delimiter and optionally keep comma as alias
  • change default of comment to null (most CSV in practice does not allow comments by default)
  • add quoteChar and skipInitialSpace as csvd and csvw agree on those
  • add header to automatically convert rows to objects when set (default false although csvd and csvw both have true as default but this can be discussed).

More support of CSV dialect requires at least someone with experience in actually working with messy CSV data (e.g. users of mr) because authors of standards tend to add features without common use cases.

wader commented

Hey, that is great and very helpful research, didn't know about any of those cvs dialect standards. I think you suggestions make sense to do. Is it something you would like to help out with coding-wise? might speed things up a it.

rename comma to delimiter and optionally keep comma as alias

Yeap good idea. The only (not great) reason it's called comma now is because that is was it's called in the csv parser used at the moment https://pkg.go.dev/encoding/csv#Reader

change default of comment to null (most CSV in practice does not allow comments by default)

Ok, so all lines will be treated as data?

add quoteChar and skipInitialSpace as csvd and csvw agree on those
add header to automatically [convert rows to objects](https://github.com/wader/fq/blob/master/doc/formats.md#convert rows-to-objects-based-on-header-row) when set (default false although csvd and csvw both have true as default but this can be discussed).

👍 could possibly also move convert to object code into go if doing in jq is slow

Maybe the csv decoder could have "dialect" option that is either a string that is a name of dialect or an object with settings?

One thing is to figure out if we could still use the csv parser in the golang standard library or needs to find another existing one or write one ourself.

Is it something you would like to help out with coding-wise? might speed things up a it.

I'm very motivated but have not coded in Go yet (should be doable and happy to learn) so the "might speed things up" does not apply. So it depends :-) But data formats are my research topic and I heavily use jq so sooner or later I need to dig deeper into fq anyway.

Ok, so all lines will be treated as data?

Yes, most CSV parsers don't enable comments by default.

Maybe the csv decoder could have "dialect" option that is either a string that is a name of dialect or an object with settings?

Yes but then you need to manage names of dialects. The only commonly agreed names I know are RFC 4180 and TSV (probably better as rfc4180 and tsv). Most people don't document their data formats on this level with names but just assume csv ad supported by the software library they happen to use (and end up with incompatible edge cases).

One thing is to figure out if we could still use the csv parser in the golang standard library or needs to find another existing one or write one ourself.

The more dialect aspects are supported, the more the danger of having to write your own CSV library. That's why I'd first limit implementation to compatibility with a subset of CSVD and CSVW.

wader commented

I'm very motivated but have not coded in Go yet (should be doable and happy to learn) so the "might speed things up" does not apply. So it depends :-) But data formats are my research topic and I heavily use jq so sooner or later I need to dig deeper into fq anyway.

Great, there is no hurry, was more if you wanted something fast :) i'm can help out with both go and jq stuff. Maybe a possible route is that i start look at it and see how much work it seems to be, possible some initial PR etc, and then we figure something out?

What kind of research are you doing? as a student, phd etc? curious. And i'm of course happy to help out other fq or format related things.

Yes but then you need to manage names of dialects. The only commonly agreed names I know are RFC 4180 and TSV (probably better as rfc4180 and tsv). Most people don't document their data formats on this level with names but just assume csv ad supported by the software library they happen to use (and end up with incompatible edge cases).

Aha i see. But it's nice that both csvddf and csvw has default values, so a fq decoder could always have that as quite safe fallback for properties not set?

Had no idea there was even efforts to standardize CSV like this, seems like good idea, is quite confusing. I've had to explain at least a couple of times that "export it as CVS" is sadly not that straight forward :) also run into issues with numbers in csv, which decimal symbol to use, that seems to the out of scope for csvddf and csvw?

The more dialect aspects are supported, the more the danger of having to write your own CSV library. That's why I'd first limit implementation to compatibility with a subset of CSVD and CSVW.

Yes true good point. So maybe try stick with standard library csv reader/writer as see how far it can go?

What kind of research are you doing? as a student, phd etc? curious.

I did my PhD thesis on patterns in data formats some years ago and I manage a structured register of data formats (in German, with focus on bibliographic data).

wader commented

I did my PhD thesis on patterns in data formats some years ago and I manage a structured register of data formats (in German, with focus on bibliographic data).

Interesting and the thesis looks like something i will like to have a look at.

As you might have noticed fq currently does not support much when it comes to schemas or generic format description languages, like kaitai stuct etc, at the moment. But I think it should be possible to add in some form, at least for decoding, encoding is different kind of beast, at least for complex formats like mp4 etc.

wader commented

Did some research about good test suits, csvw seems to have one in nice format https://github.com/w3c/csvw/tree/gh-pages/tests

wader commented

Did an initial PR to try some things out #546 see comments