Summary bucketing to columns

Question

Summary bucketing to columns

Llammissar opened this issue 8 years ago · 2 comments

Ahaha, it keeps happening! ;) This time, in a much longer form.

I'm not sure what to call this precisely, so I'll describe the problem and see what you think:

Say we have as input a list of actions and their outcome:

action                success
frob                  true
tweak                 true
frob                  false
frob                  true
twiddle               true
tweak                 false
frob                  true

The goal is something like this:

action  succ    fail
frob    3       1
tweak   1       1
twiddle 1       0

The above was done with the following incantation:

tsv-join -w 0 --filter-file \
    <(tsv-filter -H --str-eq 2:"false" input.tsv | tsv-summarize -H --group-by 1 --count-header fail) \
    -k 1 \
    -a 2 \
    <(tsv-filter -H --str-eq 2:"true" input.tsv | tsv-summarize -H --group-by 1 --count-header succ)

This is pretty nasty: two subshells (with bonus bashisms), two scans of the input file, annoying and error-prone to edit... you can probably see why I'd like to improve this one.

Two ideas come to mind for how this might work:

A bucketing action. Something like this, perhaps:
tsv-summarize --group-by 1 --bucket 2
For the above input, I'd expect this to output:

action  success_true    success_false
frob    3               1
tweak   1               1
twiddle 1               0

i.e. each unique value in the group is counted and given a bucket.

Advantages:

It's relatively simple. Low overhead on the user and the column names can be cleaned up with little effort.
It actually works for any number of unique values in the bucketed column (e.g. if I had a third "baz" status, it'd give me a column for that too).

Disadvantages:

You do have to clean up the names. Given what we discussed before, I can't really think of any decent syntaxen for naming the columns in situ.
It may be too simple? This is just a hunch, though.
More feature-creep in tsv-summarize doesn't necessarily sit well with me.

Some mechanism for pivoting (possibly as a new tool?). Using existing tools, we can get the following:

$ tsv-summarize -H --group-by 1,2 --count input.tsv | sort
action  success count
frob    false   1
frob    true    3
tweak   false   1
tweak   true    1
twiddle true    1

In this case, there needs to be something to bridge the gap. Maybe something like this?
tsv-pivot --column 2 --fact sum:3

Output... probably the same as before.

Advantages:

General tool, functional for this scenario and for any other pivot operation
Can probably reuse a lot code, including the aggregates from tsv-summarize
More flexibility WRT syntax because it's a whole new tool.

Disadvantages:

It's a new tool in the suite, so hooray for effort and bugs. Do I really need more disadvantages at this point?
...Hahah, whoops, probably still have to clean up column names!

Answer 1 · 2017-02-09T22:25:33.000Z

Nice example. I need to read through your proposals in more detail. However, my initial thought is that what you mention in point 2 is the key. Using tsv-summarize -H --group-by 1,2 --count produces "long, narrow" data. What you are asking for is a "wide" data format. Both forms are commonly used in tools for statistics, machine learning, etc. I need to put my own data in those forms quite commonly, so I definitely want to support them.

tsv-append was created for creating long, narrow data. tsv-join is good for creating wide data sets. However, it can be cumbersome to use in certain circumstances and I expect to create a version more tailored to the task. I haven't spent much time thinking about how tsv-summarize fits into this, but figuring this out might help keep the tools self-consistent.

A couple references on long/narrow and wide data formats:

Answer 2 · 2017-02-09T22:49:13.000Z

Real quick before bed: after posting this, I discovered datamash (exists. Also that it) has cross-tabulation as a separate tool, so there's at least one toolset in the same domain treating it as a separate tool. Idly, I wonder if there's not room for both concepts to coexist?