harelba/q

fixed-width columns

IzzySoft opened this issue · 8 comments

First, thanks for this very useful script – works fine even in MobaXterm out-of-the-box (used the single-file-executable)!

Unfortunately, I couldn't figure how to use it properly on "CSV" files using fixed-width columns instead of "single character delimiters". Would be great if that could be supported as well. I'd be fine if column width is determined by the header line for now (with the limitation that column names then must not contain spaces) instead of somehow "guessing correctly" (or specifying the width for each column).

Giving an example input file:

Foo     Bar        Baz                                              FooBaz
123     what       Whatever comes here might be longer              And something else
456     moo moo    Another text line                                no more ideas

(first line being the header).

bitti commented

This came up before. The solution is to pipe through a script first which replaces multiple spaces with a proper delimiter. E.g.

sed $'s/   */\t/g'

That indeed might work for many cases including my above example, but not necessarily for all. Assuming your RegEx works (rather looks like it would replace all spaces by tabs, not just multiple), what if a "cell" contains multiple spaces? Or if two columns "touch" (e.g. "123456789" could represent 3 columns with a width of 3 chars each), which is a common case with fixed-width formats?

Extending my example to include those cases:

Foo     Bar        Baz                                      FooBaz
123     what       Whatever comes here might be longer      And something else
456     moo moo    Another text line                        no more ideas
12345678pony       text    with    morespaces               "s/ {2,}/\t/g" breaks this
bitti commented

Exactly, "what if...?". What if even the column headers contain spaces? q can't possible support all cases, that's why it's expecting CSV files. If you don't have CSV, you need to ensure to properly convert it beforehand by whatever means. The sed line was just an example for that.

But maybe you're hammering the wrong problem, many tools have an option to provide machine-readable output, so you should look into that first.

Many tools have, but not all. Hence I was just suggesting this. If Harel thinks it doesn't fit in, I won't complain as I know there are differences between CSV, TSV (easily solved) on one side – and fixed-column-width on the other (requiring a different approach).

q uses SQLite under-the-hoods, and there are approaches to this in Python (e.g. here). Tricky thing is only the question of the length for the last column. And I already stated that, to automatically detect fields and field lengths for this, column headers cannot contain spaces (quote: "with the limitation that column names then must not contain spaces").

Thanks for your sed suggestion of course! If my previous response sounded offending, please accept my honest apologies.

bitti commented

No sweat, I wasn't offended. I just don't think it would fit in well because it would violate the unix principle of doing one thing well. And as you mention there are already approaches for problems like this, and probably better tools and scripts available to deal with these kind of transformations.

E.g. the GNU flavour of AWK has a FIELDWIDTHS variable: https://www.gnu.org/software/gawk/manual/html_node/Fixed-width-data.html. One way to use it for your example could be:

$ gawk 'BEGIN{OFS="|"} NR==1{ l=split($0, a, / +/, s); for (i=1; i<l; i++) f=f length(a[i])+length(s[i]) " "; FIELDWIDTHS=f " 100" } {print $1, $2, $3, $4}' test
Foo|Bar|Baz|FooBaz
123     |what       |Whatever comes here might be longer      |And something else
456     |moo moo    |Another text line                        |no more ideas
12345678|pony       |text    with    morespaces               |"s/ {2,}/\t/g" breaks this

I used | instead of tab as an output separator here to make it more visible. Making it more generic (i.e. variable number of columns, trimming whitespace and avoiding the hardcoded width of 100 for the last field) is left as an exercise for the reader.

Thanks a lot! I wasn't aware of that gawk feature – which indeed seems to fit (I'm gawking now 🤣). Need to check that out when I'm back at work. I wonder if gawk also might have a trim() feature to cleanup resulting cells…

The question of whether q should support fixed-width files is dependent on the actual definition of what q is.

If q deals with TSVs/CSVs, then obviously fixed-width files are not part of it. But if q deals with "implicit tabular data" then it should be a part of it.

In general, I view q as the latter, so I do see adding it to q as something which would add value to the users.

However, I'm not sure how this stands in terms of priority with relation to other new capabilities, as this conversion can be done using preprocessing (and also in a streaming fashion - this kind of conversion can be done using a piped approach into q).

If there's enough demand for it, I can create a mini-tool that will do such a streaming conversion, providing the capability to stream a fixed width file into q's stdin.

Thanks a lot, Harel! And yes, of course timing/prio are entirely up to you as well. Thanks to @bitti there is something that can be used as work-around meanwhile. So I'll wait patiently 😄