deeptools/pyBigWig

wrong 'stats' output?

francois-a opened this issue · 11 comments

Thanks for developing pyBigWig, this is a very useful module. I've noticed some instances where 'stats' returns an incorrect value (compared with taking the mean of the values returned by 'values'). Here's an example, using an ENCODE mappability track:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign75mer.bigWig

bw = pyBigWig('wgEncodeMapability/wgEncodeCrgMapabilityAlign75mer.bigWig')
mean(bw.values('chr1', 89294, 91629))
>> 0.22213842662134081
bw.stats('chr1', 89294, 91629)
>> [0.1902993809998163]

Thanks in advance for looking into this.

This is one of the weird thing about bigWig files. Statistics often don't function on the actual values, but on values in a zoom level. If you use bigWigSummary from the command line, what sort of value do you get?

I guess I can just check myself:

$ bigWigSummary http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign75mer.bigWig chr1 89294 91629 1
0.201209

And in python:

>>> import pyBigWig
>>> bw = pyBigWig.open("http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign75mer.bigWig")
>>> bw.stats('chr1', 89294, 91629)
[0.20120902053804418]

So I get the same values for both.

Stumbled upon the same - I think the expected behavior from a naive user is that if I ask for an array of 100 basepairs from a bigwig, the stats command should give me the mean of those 100 basepairs and not anything else dependant on zoom levels. It should at least be made very clear in the documentation.

For now I just ask for the array and wrap it in numpy.mean, no real loss of performance.

Real nice library aside from that, 10 times the performance I was getting with bx-python :)

I'll definitely update the documentation for that, since I too found this weird when I wrote libBigWig. I've actually been meaning to create an exact_stats() function, which would do what you want. Given how libBigWig is implemented that should be fairly straight forward. I'll leave this open as a reminder to do that!

BTW, if you have a better name in mind then just let me know. I could alternatively just add an option to the current stats function to do this (whatever others find the most useful is fine by me).

I have an exact_stats branch that adds the exact option to the stats command. That needs some more testing (and documentation), but should hopefully handle this. The invocation would be:

bw.stats("chr1", 1, 1000000, exact=True)

Thanks for the quick response and clarifications (I indeed expected the naive behavior) and the patch. I ran a few tests and it works fine for me. This runs significantly faster (3-4x) than wrapping with np.mean, a nice improvement when running over many regions.

New branch working for me too - 2-3 times faster than wrapping with np.mean, thanks a lot!

I've pushed a small change that should increase the precision of this a bit (I'd be surprised if there's any speed change). I've also added a section to the readme. I'll continue mulling this over for a day or two and then merge into the master branch and make a new release if there's nothing new that needs to be added.

I'm hoping to release version 0.2.6 today (assuming Travis CI starts working again), which will include this.

This is now release 0.2.6. I'll try to get this release on pypi and in conda tonight. Thanks for the suggestion.