Cannot stream standard input to apply or apply_paste
Opened this issue · 5 comments
I am trying to use wiggletools to average a signal over a set of bins defined in a BED file by piping in output from a previous command into wiggletools' stdin. Based on the EBNF grammar, it looks like the following command should work, but instead it fails claiming that it cannot open the "-" file:
bash$ cat my_signal.bg | wiggletools apply_paste out.txt meanI bins.bed -
Cannot open input file -
The command works as expected if the signal bedGraph is not streamed:
bash$ wiggletools apply_paste out.txt meanI bins.bed my_signal.bg
bash$ head out.txt
22 0 100 0.000000
22 100 200 0.000000
22 200 300 0.000000
22 300 400 0.000000
22 400 500 0.000000
22 500 600 0.000000
22 600 700 0.000000
22 700 800 0.000000
22 800 900 0.000000
22 900 1000 0.000000
The situation is the same using apply directly (both the error above and the correct output when the file is directly specified). There are other tools for the use case of averaging over a BED, but I'd like to be able to build more complicated computations with wiggletools.
Why isn't "-" recognized as an iterator of type in_filename? Perhaps I'm missing something obvious - any feedback would be greatly appreciated.
Hello @jluquette ,
Without going into implementation details, this is a curious side effect of the apply_paste
function, which triggers an unexpected exception when the last parameter is "-" i.e. standard input.
Sorry for the inconvenience,
Daniel
Thanks @dzerbino for the very quick response.
Is there a workaround? I've tried replacing the final -
with an iterator that returns the original stream unmodified (e.g., scale 1 -
), but that didn't work either. The apply
function also has the same behavior, perhaps due to the same implementation quirk, so that won't solve my issue either unfortunately.
Hello again,
I've given it some thought, and although the code can always be improved, this would ultimately hit on a design contradiction.
Fundamentally, the apply
and apply_paste
operators apply a statistical function (in this case meanI
) to an input dataset (e.g. standard input) along regions of interest (bins.bed
). The regions of interest can overlap or not be sorted, meaning that WiggleTools needs a way to arbitrarily go backwards and forwards on the input dataset. This is quite easy when the input dataset is a file or a file-based iterator, but standard input being a stream creates a complication. The obvious workaround would be to buffer the entirety of standard input, but this would create an open ended memory liability that would break WiggleTools' memory-minimal design pattern.
In conclusion, the best workaround (which you found out already), is to save the input dataset onto a file, then process that with WiggleTools, essentially using the file system as a buffer. I appreciate that there are circumstances where you are disk-space limited and this can be tricky, but by the same token placing the burden onto memory (generally speaking a more limited resource) does not seem to me as a sustainable solution. If you are constrained by writing permissions, a possibility (in Linux) would be to write into the /tmp
directory which set aside specifically for that kind of purpose.
Hope this helps,
Daniel
Thanks for the explanation - that makes plenty of sense.
Perhaps it'd be worth mentioning in the documentation and/or pointing out the difference in the EBNF grammar? By the way, is there a more up-to-date documentation source than the GitHub README? Some things (like the cat
reducer) aren't mentioned in there.
Really enjoying the tool - thanks for the great work.
The GitHub README is the only documentation. Indeed, there has been some drift in the documentation, I should review it.