campbio/celda

Unexpected behavior when subsetting on decontaminated counts

jmodlis opened this issue · 1 comments

Hello,

Thank you for this great tool! I have noticed some unexpected behavior when running Seurat's subset function in downstream analyses. It appears that I have to call subset twice when utilizing the decontaminated counts from DecontX. I understand that the decontaminated counts will be less than the original counts, so it makes sense that the nCount_RNA metadata column, for example would not reflect the decontaminated count, but I don't understand why calling subset twice would make it work, and where summary(s2l@meta.data$nCount_RNA) is truly pulling it's information from. The cleanest solution I can come up with is to set sce$nCount_RNA <- NULL and sce$nFeature_RNA <- NULL prior to calling CreateSeuratObject and this seems to make it recalculate the metrics and subset will behave as expected downstream. See below for the unexpected behavior. This is more than likely a Seurat issue, but will affect users of your tool.

>sl <- CreateSeuratObject(counts=counts(sce),
                                 meta.data=as.data.frame(colData(see)))
> sl <- subset(sl, subset=(nFeature_RNA > nFeature_RNA.ll & nFeature_RNA < nFeature_RNA.ul) & (nCount_RNA > nCount_RNA.ll &nCount_RNA < nCount_RNA.ul) & percent.mt < percent.mt.ul)
> summary(sl@meta.data$nCount_RNA)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2001    6203   10536   11039   14841   38710 
> dim(sl@meta.data)
[1] 13385    14
> 
> r <- round(decontXcounts(sce))
> s2l <- CreateSeuratObject(counts=r,
+                                  meta.data=as.data.frame(colData(sce)))
> s2l <- subset(s2l, subset=(nFeature_RNA > nFeature_RNA.ll & nFeature_RNA < nFeature_RNA.ul) & (nCount_RNA > nCount_RNA.ll &nCount_RNA < nCount_RNA.ul) & percent.mt < percent.mt.ul)
> summary(s2l@meta.data$nCount_RNA)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      4    5544    9783   10253   13948   38688 
> dim(s2l@meta.data)
[1] 13385    14
> s2l <- subset(s2l, subset=(nFeature_RNA > nFeature_RNA.ll & nFeature_RNA < nFeature_RNA.ul) & (nCount_RNA > nCount_RNA.ll &nCount_RNA < nCount_RNA.ul) & percent.mt < percent.mt.ul)
> summary(s2l@meta.data$nCount_RNA)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2001    5970   10021   10524   14080   38688 
> dim(s2l@meta.data)
[1] 12987    14
> 

Hi @jmodlis, thanks for trying out our tool! I'm not totally sure. What is stored in the colData(sce)? If variables such as nFeature_RNA and nCount_RNA are in the colData, then you may want to exclude them from the metadata when creating a new Seurat object so then Seurat can recalculate them.