OSCA-source/OSCA.multisample

Various Approaches to Independent Filtering

Closed this issue · 1 comments

Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed ... decreases the severity of the multiple testing correction.

What about keeping the top-p highly variable genes? It would also avoid housekeeping genes and reduce the penalty more.

LTLA commented

I don't think filtering on the variance is going to be independent of the p-value. I'm not sure if I did some simulations on this, but I would guess that this strategy causes the null distribution to squeeze towards the edges, i.e., towards 0 or 1. This is because you enrich for genes that are either highly variable between replicates (p-values towards 1) or the occasional gene with strong spurious DE (p-values towards zero) while getting rid of everything in between.

A more sophisticated approach would be to filter on the residual variance, but I don't think this would be a good idea either. You would be enriching for genes that have low variances by chance, causing a systematic underestimation of the variance and thus an enrichment of low p-values under the null. It would also interfere with all the distributional assumptions in empirical Bayes. (These arguments should also apply to the filter on the sample variance.)