Normalization/Log2 transformation requirements
mallorymaynes opened this issue · 4 comments
Hello and thanks for developing this model. I read in the supplemental materials that the G x S matrix for RNAseq data should be filtered for low counts, normalized, and also log2 transformed before running the model. It also gives RPKM and TPM as suggestions for the normalization, however I would like to use upper-quantile normalized counts generated by RUVg so I can include my use my spike-ins easily. Will this be a problem? So far I have filtered low count genes and extracted the normalized counts from RUVg, log2 transformed them, and rounded so they are integers. I want to be sure I am understanding correctly and that my normalization procedure checks out (and also that I'm not over-normalizing).
Thanks!
Hi @mallorymaynes , ImpulseDE2 uses a negative binomial noise model which comes with assumptions on data distribution and is built for count (ie non-normalised, non-logged, integer) data. This type of statistical modelling still works if your data transform does not validate the count data structure too much, log-ing will cause major issues most likely, for example.
Assuming that your transforms dont change the statistics too much, it may work, it would be better to use count data and to supply size factors for scale the model. Filtering genes does not affect the model fits of the other genes if you define size factors.
Thank you, this is very helpful. It sounds like I should instead use my raw counts and include the estimated factors of unwanted variation generated by RUVg - is that what you mean by supplying factors to scale the model?
Hi David, I am still a little confused about how to input my RUVseq factors of unwanted variation into ImpulseDE2. Specifically, the output for RUVseq (called "W_1") is used as a covariate in DESeq2 or edgeR models, such that the full model for a time course in DESeq2 would be "~ W_1 + time + treatment + treatment:time," and the reduced would be: "~ W_1 + treatment + time." Given this, how do I correctly integrate W_1 into ImpulseDE2? Would this be considered vecConfounders, size factors, or something I can integrate in the dfAnnotation? Thanks for your help, it is much appreciated!
This would be an element of vecConfounders
, which essentially build a model that works like the "+" nomenclature in DESeq!