natsuhiko/rasqual

Handling peaks with influential points

Closed this issue · 2 comments

Hi Natsuhiko

I am looking to remove ATAC-seq peaks from my QTL analysis that have highly influential points. I'm thinking of using Cook's distance similar to what DESeq2 does:
http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

under the "Approach to count outliers" heading. Basically, if a peak has a sample with a Cook's distance that is too large, the peak is removed from the analysis.

Do you think it would be sufficient to fit negative binomial models to the peaks using the glm function in base R and then calculate Cook's distance from these fits? This doesn't quite capture the over dispersion adjustment that RASQUAL uses in model fitting, but it would be much faster. Is this OK given that this is just a data filtering step?
Thanks!

Kevin

Hi Kevin,

I'm not familiar with the influential point stuff and I cannot suggest you whether you should incorporate in the analysis or not. I guess DESeq2 estimates overdispersion across all features (genes, ATAC peaks, etc.). This is because it requires the feature selection a priori. Instead, RASQUAL estimates overdispersion for each feature independently. Therefore the effect of bad feature is minimum.

One thing I would say is, when you filter features with some criteria, you always drop some interesting features as well (it depends on how sensitive the filter is to the peaks you are interested in). I personally don't filter out any, because I want to keep all data at the beginning and will filter out at the very end of the analysis usually.

Best regards,
Natsuhiko

DESeq2 also estimates dispersion on a per peak (or gene) basis. I think the Cook's distance filtering is just to remove any genes where the results are strongly influenced by one sample from downstream analyses. I will close this issue because this could be a pre or post processing step by the user.