chris-mcginnis-ucsf/MULTI-seq

including "negative" population helpful for algorithm?

Closed this issue · 2 comments

Hi,

I am using your great package through the MULTIseqDemux function in Seurat. Depending on the settings a varying part of each cluster is assigned to "negative" (see HTO-tSNE below):
image

Visually judged, the negatives could also be assigned to their neighboring singlet cluster.
I was wondering what your thoughts on including real negative cells (=empty drops) when running MULTIseqDemux() are? Would it make the classification of negatives more precise? Empty drops could be retrieved from the cellranger raw matrix (raw_feature_bc_matrix) as opposed to the filtered matrix.

Thanks!

Best wishes
Tilo

Hi @tilofreiwald ,

Good question! Short answer: No, including empty droplets will not make the classifications more precise. Couple of points:

1 - Assuming that MULTIseqDemux is the Seurat version of our classification workflow, you may consider reclassifying your data using the deMULTIplex R package. IIRC, the Seurat version is a preliminary version of our pipeline and doesn't work quite as well as the more recent deMULTIplex version.

2 - I would avoid including too many negative cells in the classification workflow, as it may mess with the algorithm's maxima detection step in two ways:

(A) Having too many negatives in your data may result in trimodal barcode distributions with peaks corresponding to empty droplets, cell-containing droplets from the 'incorrect' sample, and cell-containing droplets from the 'correct' sample. This can cause doublet classifications to be skewed towards certain BCs.

(B) Having too many negatives in your data may 'dilute' the on-target peak to an extent that the maxima detection step can no longer find the peak. This causes the algorithm to fail for obvious reasons.

So with all of that being said, I would suggest not including empty droplets in your classification workflow. However, I usually begin with the raw_feature_bc_matrix to avoid CellRanger biases against cells with low RNA content (which can be detected using MULTI-seq or Cell Hashing). I pre-process a Seurat object containing all 'cells' with at least X RNA UMIs (e.g., >100 UMIs), perform QC on this data (e.g., % mito, marker analysis, etc.), and then pass the list of putative cell IDs to the deMULTIplex classification workflow.

Hope this helps,
Chris

Hi, thanks, that is very helpful! Especially the skewing you mention in (A) was present in my unfiltered data when I gave it a try, but I wasn't sure if I should judge visually. I used the Seurat implementation but will check out the original version. Best, Tilo