eBay/tsv-utils

Summary bucketing to columns

Llammissar opened this issue · 2 comments

Ahaha, it keeps happening! ;) This time, in a much longer form.

I'm not sure what to call this precisely, so I'll describe the problem and see what you think:

Say we have as input a list of actions and their outcome:

action                success
frob                  true
tweak                 true
frob                  false
frob                  true
twiddle               true
tweak                 false
frob                  true

The goal is something like this:

action  succ    fail
frob    3       1
tweak   1       1
twiddle 1       0

The above was done with the following incantation:

tsv-join -w 0 --filter-file \
    <(tsv-filter -H --str-eq 2:"false" input.tsv | tsv-summarize -H --group-by 1 --count-header fail) \
    -k 1 \
    -a 2 \
    <(tsv-filter -H --str-eq 2:"true" input.tsv | tsv-summarize -H --group-by 1 --count-header succ)

This is pretty nasty: two subshells (with bonus bashisms), two scans of the input file, annoying and error-prone to edit... you can probably see why I'd like to improve this one.

Two ideas come to mind for how this might work:

  1. A bucketing action. Something like this, perhaps:
    tsv-summarize --group-by 1 --bucket 2
    For the above input, I'd expect this to output:
action  success_true    success_false
frob    3               1
tweak   1               1
twiddle 1               0

i.e. each unique value in the group is counted and given a bucket.

Advantages:

  • It's relatively simple. Low overhead on the user and the column names can be cleaned up with little effort.
  • It actually works for any number of unique values in the bucketed column (e.g. if I had a third "baz" status, it'd give me a column for that too).

Disadvantages:

  • You do have to clean up the names. Given what we discussed before, I can't really think of any decent syntaxen for naming the columns in situ.
  • It may be too simple? This is just a hunch, though.
  • More feature-creep in tsv-summarize doesn't necessarily sit well with me.
  1. Some mechanism for pivoting (possibly as a new tool?). Using existing tools, we can get the following:
$ tsv-summarize -H --group-by 1,2 --count input.tsv | sort
action  success count
frob    false   1
frob    true    3
tweak   false   1
tweak   true    1
twiddle true    1

In this case, there needs to be something to bridge the gap. Maybe something like this?
tsv-pivot --column 2 --fact sum:3

Output... probably the same as before.

Advantages:

  • General tool, functional for this scenario and for any other pivot operation
  • Can probably reuse a lot code, including the aggregates from tsv-summarize
  • More flexibility WRT syntax because it's a whole new tool.

Disadvantages:

  • It's a new tool in the suite, so hooray for effort and bugs. Do I really need more disadvantages at this point?
  • ...Hahah, whoops, probably still have to clean up column names!

Nice example. I need to read through your proposals in more detail. However, my initial thought is that what you mention in point 2 is the key. Using tsv-summarize -H --group-by 1,2 --count produces "long, narrow" data. What you are asking for is a "wide" data format. Both forms are commonly used in tools for statistics, machine learning, etc. I need to put my own data in those forms quite commonly, so I definitely want to support them.

tsv-append was created for creating long, narrow data. tsv-join is good for creating wide data sets. However, it can be cumbersome to use in certain circumstances and I expect to create a version more tailored to the task. I haven't spent much time thinking about how tsv-summarize fits into this, but figuring this out might help keep the tools self-consistent.

A couple references on long/narrow and wide data formats: