brentp/combined-pvalues

sidak correction in region simulations.

brentp opened this issue · 3 comments

There is no multiple-testing correction on the p-values for the regions.
rpsim can do the sidak correction correctly because it knows
the entire region length (sum of coverage in -p argument)

so sidak is

sidak_p = 1-(1 - p)^k

where k is the number of possible regions of that length:

k = total_coverage_bp / region_length_bp

total_coverage_bp is the total number of bases covered by region to -p.
region_length_bp is simply the end - start for the region.
Probably do this as:

coverage = collections.defaultdict(set)

for chrom, start, end in regions:
    coverage[chrom].update(range(start, end))

total_coverage_bp = sum(len(rset) for rset in coverage.itervalues())

Since they are sorted, can also use less memory using groupby on chrom
and calculating the coverage per-chrom...

In checking real data, it's clear that if --threshold is much higher than --seed, then the ends of the region will have high p-values. This lowers the overall significance. It might be good to trim the regions in peaks.py to the edge of the last value > --seed.

Much of this is implemented as of: 32b4692

Done.