
Bayesian Allocation of Differential Resamples (BayDAR)

Primary LanguageC++

Bayesian Allocation for Differential Resampling


In genomic studies, one is often faced with mutiple-hypotheses testing scenarios (~1E+06). When analytical formulas are available the main problem is determining an appropriate threshold for rejecting the null hypotheses and controlling the family-wide error rate (FWER) or false discovery rate (FDR). However, often in the case of some test statistics an analytical formula is unavailable or underlying model assumptions are unrealistic or unmet (e.g. small sample sizes or non-Gaussian data). In these cases, a resampling-based test is desirable. However, this approach can be computationally infeasible: To reject H0 at a significance threshold of 1E-06, which is common in GWAS studies, at 1 million SNPs would require at least 1 trillion resamples.


To avoid this pitall, Wang et al. proposed a Bayesian scheme for the differential allocation of resamples. The algorithm is described in detail in the paper but can succinctly described as follows-

"...we use a Bayesian-inspired approach that assigns resamples to each unit based on its individual risk, the chance that the current p-value estimate leads to a misclassification of the unit. The goal is to lower the numbers of classification errors, since we are giving a higher resolution to the null distribution of genes that are more likely to be misclassified in a uniform allocation setting. This higher resolution comes at the sacrifice of resamples to non-borderline genes that should not need a very resolute null distribution for correct inference."

The intuition for the risk of misclassification can be visually summarized:

Here, two different densities are visualized; the red density corresponds to a scenario where the true p-value, p1, is close to p0 and thus there is more area under the curve; the blue density corresponds to a scenario where the true p-value, p2, is farther away from p0 and there there is less area under the curve. The number of resamples allocated to both p1 and p2 are proportional to the shaded areas, which visualize the risk of misclassification, respectively.



The code availble in this repository can be used to apply the BayDAR algorithm to a matrix with each row an observationwith a user-provided function to calculate the test-statistic.
