gte not working and compression
gevro opened this issue · 8 comments
Hi,
I tested 'gte' and it isn't working. Example below. I'm guessing lte might have the same issue.
Note also that gt/gte/lt/lte are still causing compression with write_bg. I think that is useful in some cases, but not desirable in other cases. So I think it would be good to have it as an option: either with or without compression. Maybe:
- write_bg is compressed, and
- write_bgu is uncompressed
wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1 0 20 0.005435
chr1 20 40 0.000000
chr1 40 60 0.000000
chr1 60 80 0.000000
chr1 80 100 0.000000
chr1 100 120 0.000000
chr1 120 140 0.000000
chr1 140 160 0.005435
chr1 160 180 0.000000
chr1 180 200 0.000000
$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1 280 400 1.000000
chr1 420 1940 1.000000
chr1 1960 2020 1.000000
chr1 2040 2280 1.000000
chr1 2300 2400 1.000000
chr1 2420 2440 1.000000
chr1 2460 2480 1.000000
chr1 2500 2540 1.000000
chr1 2560 2600 1.000000
chr1 2640 2660 1.000000
$ wiggletools write_bg - gte 0.005434 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1 0 20 1.000000
chr1 140 160 1.000000
chr1 260 2280 1.000000
chr1 2300 2440 1.000000
chr1 2460 2540 1.000000
chr1 2560 2620 1.000000
chr1 2640 2880 1.000000
chr1 2900 2920 1.000000
chr1 2960 3000 1.000000
chr1 3040 3060 1.000000
Hello @gevro ,
Without the raw files, I was not able to reproduce your bug, however it is possible that the values at chr1:0-20 and chr1:140-160 were just below the cutoff (0.005435) and rounded up to 0.005435 when printing.
Also, I don't understand what you mean by compression. The gte
operator is returning all the regions that have a score greater or equal to 0.005435 as advertised. If you also want the regions with 0, I would recommend re-ordering the command as such (note that the binning and filling in happens twice):
wiggletools write_bg - bin 20 fillIn genome.bed gte 0.005435 bin 20 fillIn genome.bed scale 0.05 blah.bw
HTH,
Daniel
Hi @dzerbino, Thanks. Here is what I mean by the gte bug and by compression:
- gte:
In this example, there is a bin "chr1 0 20 0.005435" that should match the "gte 0.005435" filter:
wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1 0 20 0.005435
chr1 20 40 0.000000
chr1 40 60 0.000000
chr1 60 80 0.000000
chr1 80 100 0.000000
chr1 100 120 0.000000
chr1 120 140 0.000000
chr1 140 160 0.005435
chr1 160 180 0.000000
chr1 180 200 0.000000
However, when adding that filter, that bin is missing:
$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1 280 400 1.000000
chr1 420 1940 1.000000
chr1 1960 2020 1.000000
chr1 2040 2280 1.000000
Because the gte filter is applied after "bin 20 scale 0.05", it should be a simple filter that keeps any bin with that value or larger. There technically shouldn't be any issues of rounding, because the filter is applied after the bin 20 scale 0.05 calculations. I think more intuitive behavior would be to apply the filters after rounding? Just an idea. But perhaps the current behavior is more desirable for accuracy, but if so, documentation should specify that the threshold specified should be made slightly different (-.0001 or so, for gte, and conversely for lte) in order for it to work properly.
- Compressed vs uncompressed: In the below example, there is a bin "chr1 260 2280 1.0000". However, it would be very useful to have uncompressed output after gt/gte/lt/lte, where every bin that passes the filter is listed. For example:
chr1 260 280 1.000
chr1 280 300 1.000
chr1 300 320 1.000
etc.
$ wiggletools write_bg - gte 0.005434 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1 0 20 1.000000
chr1 140 160 1.000000
chr1 260 2280 1.000000
chr1 2300 2440 1.000000
chr1 2460 2540 1.000000
Hello @gevro ,
-
The values are rounded just before they are printed into output (with 6 digits past the decimal dot). Therefore it is possible that a block has a value of 0.0054348, which is below 0.005435, but is rounded up to 0.005435 when printed out.
-
I'm not getting the same results as you...
noah-login-02.ebi.ac.uk> wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw | head
chr1 109340 110360 1.000000
I think this is related to Issue #62
Cheers,
Daniel
Hi Daniel,
Regarding point #2, that is interesting. There is definitely some issue here. Because the data in that region looks like this:
$wiggletools write_bg - bin 20 scale 0.05 fillIn genome.bed blah.bw
chr1 109320 109340 0.000000
chr1 109340 109360 0.569123
chr1 109360 109380 0.900428
chr1 109380 109400 0.914686
chr1 109400 109420 0.916929
chr1 109420 109440 0.908535
chr1 109440 109460 0.931414
chr1 109460 109480 0.938542
chr1 109480 109500 0.935714
chr1 109500 109520 0.915357
chr1 109520 109540 0.867500
chr1 109540 109560 0.869167
chr1 109560 109580 0.827153
chr1 109580 109600 0.781641
chr1 109600 109620 0.795388
chr1 109620 109640 0.826795
But gte 0.005435 only gives one bin as the result:
$ wiggletools write_bg - gte 0.005435 bin 20 scale 0.05 fillIn genome.bed blah.bw
chr1 109340 110360 1.000000
...which is wrong, because there are other bins that are > 0.005435
So perhaps it is related to issue #62.
Hello @gevro ,
Actually, you are right, there was an implicit compression system which I had forgotten about, and which I now cleared up.
Could you please try again on master?
Cheers,
Daniel
Thanks. Is it possible to add write_bg with an option for either compressed or uncompressed? Maybe with a flag? Or write_bgu versus write_bgc ?
There are use cases for both options, so it would be very useful to have both.
I have just now implemented a new "compress" keyword on master (cf README).
Let me know,
Daniel
Thanks, works.