Ensembl/WiggleTools

Cannot stream standard input to apply or apply_paste

Opened this issue · 5 comments

I am trying to use wiggletools to average a signal over a set of bins defined in a BED file by piping in output from a previous command into wiggletools' stdin. Based on the EBNF grammar, it looks like the following command should work, but instead it fails claiming that it cannot open the "-" file:

bash$ cat my_signal.bg | wiggletools apply_paste out.txt meanI bins.bed -
Cannot open input file -

The command works as expected if the signal bedGraph is not streamed:

bash$ wiggletools apply_paste out.txt meanI bins.bed my_signal.bg
bash$ head out.txt
22	0	100	0.000000
22	100	200	0.000000
22	200	300	0.000000
22	300	400	0.000000
22	400	500	0.000000
22	500	600	0.000000
22	600	700	0.000000
22	700	800	0.000000
22	800	900	0.000000
22	900	1000	0.000000

The situation is the same using apply directly (both the error above and the correct output when the file is directly specified). There are other tools for the use case of averaging over a BED, but I'd like to be able to build more complicated computations with wiggletools.

Why isn't "-" recognized as an iterator of type in_filename? Perhaps I'm missing something obvious - any feedback would be greatly appreciated.

Hello @jluquette ,

Without going into implementation details, this is a curious side effect of the apply_paste function, which triggers an unexpected exception when the last parameter is "-" i.e. standard input.

Sorry for the inconvenience,

Daniel

Thanks @dzerbino for the very quick response.

Is there a workaround? I've tried replacing the final - with an iterator that returns the original stream unmodified (e.g., scale 1 -), but that didn't work either. The apply function also has the same behavior, perhaps due to the same implementation quirk, so that won't solve my issue either unfortunately.

Hello again,

I've given it some thought, and although the code can always be improved, this would ultimately hit on a design contradiction.

Fundamentally, the apply and apply_paste operators apply a statistical function (in this case meanI) to an input dataset (e.g. standard input) along regions of interest (bins.bed). The regions of interest can overlap or not be sorted, meaning that WiggleTools needs a way to arbitrarily go backwards and forwards on the input dataset. This is quite easy when the input dataset is a file or a file-based iterator, but standard input being a stream creates a complication. The obvious workaround would be to buffer the entirety of standard input, but this would create an open ended memory liability that would break WiggleTools' memory-minimal design pattern.

In conclusion, the best workaround (which you found out already), is to save the input dataset onto a file, then process that with WiggleTools, essentially using the file system as a buffer. I appreciate that there are circumstances where you are disk-space limited and this can be tricky, but by the same token placing the burden onto memory (generally speaking a more limited resource) does not seem to me as a sustainable solution. If you are constrained by writing permissions, a possibility (in Linux) would be to write into the /tmp directory which set aside specifically for that kind of purpose.

Hope this helps,

Daniel

Thanks for the explanation - that makes plenty of sense.

Perhaps it'd be worth mentioning in the documentation and/or pointing out the difference in the EBNF grammar? By the way, is there a more up-to-date documentation source than the GitHub README? Some things (like the cat reducer) aren't mentioned in there.

Really enjoying the tool - thanks for the great work.

The GitHub README is the only documentation. Indeed, there has been some drift in the documentation, I should review it.