Question on phitest
Opened this issue · 7 comments
Hi, thanks for the pacakge. I am trying to incorporate phitestR in my workflow. May I know whats the rationale behind t-testing the zero counts instead of the dispersion variable itself?
Addon: Why would the common dispersion tends to under-estimate the zero frequencies of biologically variable genes compared with the gene-specific dispersion if its heterogenous. Cant it over-estimate the zero frequencies too?
Hello,
A main advantage of testing the zero count frequency instead of dispersion itself is computational efficiency.
Regarding the second question, the gene-specific dispersions overall correspond to a more flexible model and will have greater capability in capturing variation presented in the data. I have not seen cases where the common dispersion over-estimates the zero frequencies.
Best,
Vivian
Unfortunately, I didnt get very good results with phitest. My p-value is higher for completely random clusters than carefully manually annotated clusters. I think that the assumption only applies to some cell types. Some of clusters such as HSCs have super negative p-values
Thanks for reporting the issue. Can you clarify what "negative p values" represent?
Phitest relies on the assumptions of Negative Binomial distributions, so in real data applications, we suggest users to also investigate plots like Figure 1 and Supplementary Figure S2 in addition to the P values before making a decision.
Sorry, I meant super small p-values. Like 1e-20, while a completely randomized clustering(I randomly permutated the clustered labels) yielded me values like 1e-6.
If you can share some reproducible data, I'm interested in looking into the issue.
Hi sorry I couldnt share my data. However, I see a similar problem with randomized clustering and combining 2 clusters with a public dataset(although I didnt get small p values here). Heres the code that I used. It seems that the common dispersion is larger than the gene fitted ones.
``
rna <- readRDS("raw_data/ADT_RNA_10x/AML_Greenleaf/scRNA-Healthy-Hematopoiesis-191120.rds")
rna <- CreateSeuratObject(assay(rna, "counts"), meta.data = as.data.frame(colData(rna)))
rna <- NormalizeData(rna)
rna <- FindVariableFeatures(rna)
rna <- ScaleData(rna)
rna <- RunPCA(rna)
rna <- RunUMAP(rna, dims = 1:30)
Idents(rna) <- "BioClassification"
DimPlot(rna, label = T)
rna <- subset(rna, downsample = 30)
df1 <- phitest(as.matrix(rna@assays$RNA@counts), label = rna$BioClassification, ncores = 40)''
a <- rna$BioClassification
names(a) <- NULL
rna$random <- a[sample(c(1:dim(rna)[2]))]
df2 <- phitest(as.matrix(rna@assays$RNA@counts), label = rna$random, ncores = 40)
rna_nk <- subset(rna, ident = "25_NK")
rna_mix <- subset(rna, ident = c("25_NK", "18_Plasma"))
df1 <- phitest(as.matrix(rna_nk@assays$RNA@counts), label = rep("a", 30), ncores = 40)
df2 <- phitest(as.matrix(rna_mix@assays$RNA@counts), label = rep("a", 60), ncores = 40)
After looking through the plots, I think that the problem is that in a heterogenous population, the number of variable genes dominate the non variable genes, compared to a homogenous population. Therefore the common dispersion estimated is very high for a heterogenous population, and not fit the plot well. I think a quantile regression might work better in this case.