eBay/tsv-utils

Add some way to split a field

Llammissar opened this issue · 2 comments

Another feature request that came to mind as I was working. Consider the following single column of data:

file
5core_05thread
5core_06thread
5core_07thread
5core_08thread

I ended up doing it in post-process, but I think it'd be handy to have some way to split fields so that it comes out like this:

cores  threads
5      5
5      6
5      7
5      8

Nice use case. My first thought is to wonder if there enough commonality in these patterns to develop a tool around. More examples would shed light on this. But, if it turned out that the flexibility of awk or sed is needed, then it might be best to leave these tasks to those tools and custom scripts.

That's a good point, and I'm not unsympathetic to it at all. If I hit more examples, I'll try to remember to outline them here.

I'll note up front that I really don't like sed/awk for this sort of thing because they're specifically general line-oriented tools. It's fine if there's something like "cores" to anchor on for extracting numbers and splitting them (and I think you rightly surmise that I wasn't looking to necessarily extract the column name in the same operation), but for the more general case? They're clunky-- the awareness of columns is extremely powerful and useful.

Just doodling here, but something like:
tsv-filter --split 1:_:cores,threads
...could be helpful. Or maybe something like regex substitution via capture groups:
tsv-filter --split 1:'([0-9]+)cores_([0-9]+)threads':cores,threads
...if we continue looking at my original example. (The column selector is necessary for the more general case that you have multiple columns with the delimiter of interest -- colon, for example -- but you only want to split one of them and the other is something like a timestamp.)

Broadly, I think I'd characterise this class of problem as "normalisation", which also includes other transformations on columns. (For example, some existing tools produce measures in whole seconds, so I want to multiply that my 1000 or divide the millisecond metrics by the same so they can be compared properly. ...This might be a separate ER?)