Ensembl/WiggleTools

gte not working and compression

gevro opened this issue · 8 comments

gevro commented

Hi,
I tested 'gte' and it isn't working. Example below. I'm guessing lte might have the same issue.

Note also that gt/gte/lt/lte are still causing compression with write_bg. I think that is useful in some cases, but not desirable in other cases. So I think it would be good to have it as an option: either with or without compression. Maybe:

  • write_bg is compressed, and
  • write_bgu is uncompressed
wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1	0	20	0.005435
chr1	20	40	0.000000
chr1	40	60	0.000000
chr1	60	80	0.000000
chr1	80	100	0.000000
chr1	100	120	0.000000
chr1	120	140	0.000000
chr1	140	160	0.005435
chr1	160	180	0.000000
chr1	180	200	0.000000
$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1	280	400	1.000000
chr1	420	1940	1.000000
chr1	1960	2020	1.000000
chr1	2040	2280	1.000000
chr1	2300	2400	1.000000
chr1	2420	2440	1.000000
chr1	2460	2480	1.000000
chr1	2500	2540	1.000000
chr1	2560	2600	1.000000
chr1	2640	2660	1.000000
$ wiggletools write_bg - gte 0.005434 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1	0	20	1.000000
chr1	140	160	1.000000
chr1	260	2280	1.000000
chr1	2300	2440	1.000000
chr1	2460	2540	1.000000
chr1	2560	2620	1.000000
chr1	2640	2880	1.000000
chr1	2900	2920	1.000000
chr1	2960	3000	1.000000
chr1	3040	3060	1.000000

Hello @gevro ,

Without the raw files, I was not able to reproduce your bug, however it is possible that the values at chr1:0-20 and chr1:140-160 were just below the cutoff (0.005435) and rounded up to 0.005435 when printing.

Also, I don't understand what you mean by compression. The gte operator is returning all the regions that have a score greater or equal to 0.005435 as advertised. If you also want the regions with 0, I would recommend re-ordering the command as such (note that the binning and filling in happens twice):

wiggletools write_bg - bin 20 fillIn genome.bed gte 0.005435 bin 20 fillIn genome.bed scale 0.05 blah.bw

HTH,

Daniel

gevro commented

Hi @dzerbino, Thanks. Here is what I mean by the gte bug and by compression:

  1. gte:

In this example, there is a bin "chr1 0 20 0.005435" that should match the "gte 0.005435" filter:

wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1	0	20	0.005435
chr1	20	40	0.000000
chr1	40	60	0.000000
chr1	60	80	0.000000
chr1	80	100	0.000000
chr1	100	120	0.000000
chr1	120	140	0.000000
chr1	140	160	0.005435
chr1	160	180	0.000000
chr1	180	200	0.000000

However, when adding that filter, that bin is missing:

$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1	280	400	1.000000
chr1	420	1940	1.000000
chr1	1960	2020	1.000000
chr1	2040	2280	1.000000

Because the gte filter is applied after "bin 20 scale 0.05", it should be a simple filter that keeps any bin with that value or larger. There technically shouldn't be any issues of rounding, because the filter is applied after the bin 20 scale 0.05 calculations. I think more intuitive behavior would be to apply the filters after rounding? Just an idea. But perhaps the current behavior is more desirable for accuracy, but if so, documentation should specify that the threshold specified should be made slightly different (-.0001 or so, for gte, and conversely for lte) in order for it to work properly.

  1. Compressed vs uncompressed: In the below example, there is a bin "chr1 260 2280 1.0000". However, it would be very useful to have uncompressed output after gt/gte/lt/lte, where every bin that passes the filter is listed. For example:
    chr1 260 280 1.000
    chr1 280 300 1.000
    chr1 300 320 1.000
    etc.
$ wiggletools write_bg - gte 0.005434 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1	0	20	1.000000
chr1	140	160	1.000000
chr1	260	2280	1.000000
chr1	2300	2440	1.000000
chr1	2460	2540	1.000000

Hello @gevro ,

  1. The values are rounded just before they are printed into output (with 6 digits past the decimal dot). Therefore it is possible that a block has a value of 0.0054348, which is below 0.005435, but is rounded up to 0.005435 when printed out.

  2. I'm not getting the same results as you...

noah-login-02.ebi.ac.uk>  wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1	109340	110360	1.000000

I think this is related to Issue #62

Cheers,

Daniel

gevro commented

Hi Daniel,
Regarding point #2, that is interesting. There is definitely some issue here. Because the data in that region looks like this:

$wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw
chr1    109320  109340  0.000000
chr1    109340  109360  0.569123
chr1    109360  109380  0.900428
chr1    109380  109400  0.914686
chr1    109400  109420  0.916929
chr1    109420  109440  0.908535
chr1    109440  109460  0.931414
chr1    109460  109480  0.938542
chr1    109480  109500  0.935714
chr1    109500  109520  0.915357
chr1    109520  109540  0.867500
chr1    109540  109560  0.869167
chr1    109560  109580  0.827153
chr1    109580  109600  0.781641
chr1    109600  109620  0.795388
chr1    109620  109640  0.826795

But gte 0.005435 only gives one bin as the result:

$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw
chr1	109340	110360	1.000000

...which is wrong, because there are other bins that are > 0.005435

So perhaps it is related to issue #62.

Hello @gevro ,

Actually, you are right, there was an implicit compression system which I had forgotten about, and which I now cleared up.

Could you please try again on master?

Cheers,

Daniel

gevro commented

Thanks. Is it possible to add write_bg with an option for either compressed or uncompressed? Maybe with a flag? Or write_bgu versus write_bgc ?

There are use cases for both options, so it would be very useful to have both.

I have just now implemented a new "compress" keyword on master (cf README).

Let me know,

Daniel

gevro commented

Thanks, works.