Feature selection on count data

Question

Feature selection on count data

joseah opened this issue 4 years ago · 1 comments

Thanks for the corral package!

This is an analysis question: Both Seurat and the OSCA pipelines apply primarily feature selection on log-transformed data before performing PCA (e.g. top 2000 HVG). As corral works on count data, it seems counterintuitive to select highly-variable genes on log-transformed data before running corral. Is there any recommended method for feature selection pre-corral?

Answer 1 · 2020-09-15T23:31:17.000Z

Hi Jose, thanks for your interest in the package!

That is a great question, and I agree it’s not a natural pipeline choice to use the feature selection approaches that require log-transformation. Nonetheless, in the datasets for which we’ve tested corral, we’ve found that the method is fairly robust to differences in feature selection, and that these approaches do work fine because they are really just serving as a rough filter. Since corral is fast, the main purpose of doing feature selection is to remove the noise / very low count genes. Moreover, in our experience the number of cells is often the limiting factor with respect to runtime. If more precise feature selection is desired, one option is to run a “preliminary” corral with all (or a large set of) genes, and then to select the genes with high embedding weights (PCu).

Townes, et al 2019 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6) discuss the issues with log-transformation in detail, and they suggested using deviance residuals (implemented in scry), which is another possible option that avoids the log-transformation step.

Did you have any particular approaches in mind? I’d be curious to hear if you have other ideas.