sidak correction in region simulations.
brentp opened this issue · 3 comments
There is no multiple-testing correction on the p-values for the regions.
rpsim can do the sidak correction correctly because it knows
the entire region length (sum of coverage in -p argument)
so sidak is
sidak_p = 1-(1 - p)^k
where k is the number of possible regions of that length:
k = total_coverage_bp / region_length_bp
total_coverage_bp is the total number of bases covered by region to -p.
region_length_bp is simply the end - start
for the region.
Probably do this as:
coverage = collections.defaultdict(set)
for chrom, start, end in regions:
coverage[chrom].update(range(start, end))
total_coverage_bp = sum(len(rset) for rset in coverage.itervalues())
Since they are sorted, can also use less memory using groupby on chrom
and calculating the coverage per-chrom...
In checking real data, it's clear that if --threshold is much higher than --seed, then the ends of the region will have high p-values. This lowers the overall significance. It might be good to trim the regions in peaks.py to the edge of the last value > --seed.
Done.