Multi-file Join
Llammissar opened this issue · 3 comments
Hey there, been a while. :)
I just tired to use tsv-join on a bunch of files and was very surprised when it applied the filter file to each of them in turn. Is there an approach that lets me nicely pull them all together that I've missed?
Why?: One of the really nice things about tsv-join
is, since you're working with a key, ordering doesn't matter and holes in the data can be mitigated. But it's clumsy to have to repeatedly join a whole pile of intermediate things to get an end result.
To clarify: with the following data files...
left
label v1
foo 1
bar 11
baz 111
centre
label v2
foo 2
bar 22
baz 222
right
label v3
foo 3
bar 33
baz 333
...I'd like to see something like this:
$ tsv-join -f left -k 1 left centre right
label v1 v2 v3
foo 1 2 3
bar 11 22 33
baz 111 222 333
Instead, that just prints each file in turn. (N.B. I reused the filter file as a data file because I want to preserve the column ordering.)
Trying this instead...
$ tsv-join -f right -k 1 -a 2 left centre
label v1 v3
foo 1 3
bar 11 33
baz 111 333
label v2 v3
foo 2 3
bar 22 33
baz 222 333
...closer. Sort of. (I guess?)
This is the best I've got right now and I think it should be reliable but I haven't tested it on something substantial:
$ tsv-select -f1 left > labels
$ for I in left centre right; do tsv-join -f left -k1 $I | tsv-select -f2 > intermediate-$I; done
$ paste labels intermediate-{left,centre,right}
label v1 v2 v3
foo 1 2 3
bar 11 22 33
baz 111 222 333
Hey, good to see you back! So, there is a way to do this:
$ tsv-join -f right -k 1 -a 2 centre | tsv-join -f - -k 1 left -a 2-3
label v1 v2 v3
foo 1 2 3
bar 11 22 33
baz 111 222 333
I agree though, the current tsv-join
APIs are clunky for this case. And it's a common enough case. Generally I'm pretty happy with the APIs of the TSV Utilities tools, but this one could be better. It's on my list to create an alternate API or perhaps a another tool that would make this simpler. Perhaps a tsv-join-files
tool, or something like that. I'm not sure when I'll get to it though.
Here's another way in the current tool. This will probably feel more natural than the first example I gave:
$ tsv-join left.tsv -k 1 -f centre.tsv -a 2 | tsv-join -f right.tsv -k 1 -a 2
label v1 v2 v3
foo 1 2 3
bar 11 22 33
baz 111 222 333
Both forms are doing similar things: join two files in the first tsv-join
call, passing the output to another tsv-join
call. Every subsequent tsv-join
adds one more file. The second version should be better because the files are added in the order you want them listed. And, the same column numbers are appended each time. In the first form the column numbers have to be adjusted each call.
The results of the two styles are the same if there are no duplicate keys. This is normally the case when doing this type of multi-file join operation.
The time duplicate keys matters is when you are using one file as a filter on another the other. In that case, the "filter file" (-f|--filter-file
) is being used as a filter on all the other files. The other files may have duplicate keys, the join doesn't care. (It's streaming join, so you can have an infinite length input.)
Ah, thanks for the tip. So generalising it looks something like this:
function rabid-join {
for I in $*; do
if [[ -z $SEED ]]; then
SEED=$I;
continue;
elif [[ -z $FORE ]]; then
FORE="tsv-join $SEED -k 1 -f $I -a 2";
printf "$FORE"
continue
else
printf " | tsv-join -f $I -k 1 -a 2"
fi
done
}
eval $(rabid-join left centre right more-right further-right absurdly-right)
Only problem I have is I can't seem to convince this to deal with the situation where a later file has keys the first one does not.