Summary bucketing to columns
Llammissar opened this issue · 2 comments
Ahaha, it keeps happening! ;) This time, in a much longer form.
I'm not sure what to call this precisely, so I'll describe the problem and see what you think:
Say we have as input a list of actions and their outcome:
action success
frob true
tweak true
frob false
frob true
twiddle true
tweak false
frob true
The goal is something like this:
action succ fail
frob 3 1
tweak 1 1
twiddle 1 0
The above was done with the following incantation:
tsv-join -w 0 --filter-file \
<(tsv-filter -H --str-eq 2:"false" input.tsv | tsv-summarize -H --group-by 1 --count-header fail) \
-k 1 \
-a 2 \
<(tsv-filter -H --str-eq 2:"true" input.tsv | tsv-summarize -H --group-by 1 --count-header succ)
This is pretty nasty: two subshells (with bonus bashisms), two scans of the input file, annoying and error-prone to edit... you can probably see why I'd like to improve this one.
Two ideas come to mind for how this might work:
- A bucketing action. Something like this, perhaps:
tsv-summarize --group-by 1 --bucket 2
For the above input, I'd expect this to output:
action success_true success_false
frob 3 1
tweak 1 1
twiddle 1 0
i.e. each unique value in the group is counted and given a bucket.
Advantages:
- It's relatively simple. Low overhead on the user and the column names can be cleaned up with little effort.
- It actually works for any number of unique values in the bucketed column (e.g. if I had a third "baz" status, it'd give me a column for that too).
Disadvantages:
- You do have to clean up the names. Given what we discussed before, I can't really think of any decent syntaxen for naming the columns in situ.
- It may be too simple? This is just a hunch, though.
- More feature-creep in tsv-summarize doesn't necessarily sit well with me.
- Some mechanism for pivoting (possibly as a new tool?). Using existing tools, we can get the following:
$ tsv-summarize -H --group-by 1,2 --count input.tsv | sort
action success count
frob false 1
frob true 3
tweak false 1
tweak true 1
twiddle true 1
In this case, there needs to be something to bridge the gap. Maybe something like this?
tsv-pivot --column 2 --fact sum:3
Output... probably the same as before.
Advantages:
- General tool, functional for this scenario and for any other pivot operation
- Can probably reuse a lot code, including the aggregates from tsv-summarize
- More flexibility WRT syntax because it's a whole new tool.
Disadvantages:
- It's a new tool in the suite, so hooray for effort and bugs. Do I really need more disadvantages at this point?
- ...Hahah, whoops, probably still have to clean up column names!
Nice example. I need to read through your proposals in more detail. However, my initial thought is that what you mention in point 2 is the key. Using tsv-summarize -H --group-by 1,2 --count
produces "long, narrow" data. What you are asking for is a "wide" data format. Both forms are commonly used in tools for statistics, machine learning, etc. I need to put my own data in those forms quite commonly, so I definitely want to support them.
tsv-append
was created for creating long, narrow data. tsv-join
is good for creating wide data sets. However, it can be cumbersome to use in certain circumstances and I expect to create a version more tailored to the task. I haven't spent much time thinking about how tsv-summarize
fits into this, but figuring this out might help keep the tools self-consistent.
A couple references on long/narrow and wide data formats: