chris-mcginnis-ucsf/MULTI-seq

No threshold found for barcode

Closed this issue · 10 comments

I am using deMULTIplex on a dataset with six barcodes, of which one (no. 3) happens to be notably more prevalent than the other barcodes. deMULTIplex gives the following note when running classifyCells: "No threshold found for Bar3..." - and as a result, all of those cells are classified as negative. When looking at barcode counts, however, it is immediately obvious that they belong to barcode 3, and the expression of this barcode is 3-4 magnitudes higher than any other barcode. From looking at cells classified in the remaining barcodes (1,2, 4-6), there doesn't appear to be much background noise from barcode 3 either. Is there something I can do to improve classification?

Hello!

Could you upload the barcode TSNE plots?

Also what kind of pre-processing are you doing before defining the list of cell IDs to classify?

Chris

hey @skannan4 -- let me know if you ever figured this out!

Hi, @skannan4
I have similar issues, how did you work it out? Thansk!

Hey Chris, apologies for raising an issue and then falling off the face of the planet. We've had deMULTIplex work well in our most recent Multi-seq run but I was revisiting this older dataset and I think I realised what happened. At least in that particular experiment, we had one barcode dominate in particular. deMULTIplex sets the lower of the local maxima based on the mode, but in this particular case that didn't work because the mode was actually equivalent to the highest maxima for that dominating barcode. We initially proceeded using a basic cutoff (e.g. if proportion of barcode > 0.9, we assigned it that barcode), but I could envision possible solutions within deMULTIplex. At least in the quick case studies I tested, the extrema finding only ever found two extreme, so perhaps it could be possible to just set the lower extrema to the min rather than mode? (I'm assuming there are cases where that breaks down). Alternatively, we could mix together two datasets, in particular with a dataset lacking that dominating barcode and that could resolve the issue.

Hey @skannan4 -- all good! Thanks for touching base again.

Funny enough, my first deMULTIplex implementation actually defined the 'negative' peak as the lowest local maxima instead of the mode. The reason why I switched to the mode is because I found that in some of my initial datasets, local maxima were being identified that were considerably lower than the usually-obvious 'negative' peak. These maxima just reflect background MULTI-seq barcodes -- so kudos for making a super clean dataset where you don't see this -- and end up making classification results that are biased against certain barcodes (as you imagined).

So considering this, I opted for the mode-based negative peak identification because I assumed the vast majority of sample multiplexing experiments (i) would not include a single sample that had a frequency greater than the sum of all other barcodes and/or (ii) would include more than 2 samples. My sense is that this is generally true, but I will definitely include a caveat in the readme about this.

For cases where you only have 2-3 barcodes, it might honestly just be easier (and more accurate, tbh) to just compute the kernel density estimations and then manually set the thresholds. This workflow is not so bad if you only have a few BCs, but becomes untenable when you start scaling.

Finally, demuxEM, GMM-Demux, and Solo could handle this use-case without issue.

Chris

Hi @skannan4 and @chris-mcginnis-ucsf

I have a very similar scenario in which, I just have 3-4 sample barcodes and one the sample barcode has relatively very high proportion of # reads when compared to other sample barcodes. I saw @skannan4 post of using basic cut-off of 0.9 and @chris-mcginnis-ucsf suggestion to compute KDE and manually set the threshold. I have used @skannan4 approach, but how to define for doublets ? Example if a cell barcode has equal frequency (not sure if that would happen), should we label that as multiplet ?

@chris-mcginnis-ucsf Could you please detail a little bit on the later approach.

Thanks.

Hey @rjg2186 - I can chime in briefly with our logic, tho obvious Chris is gonna be the guy here. Assuming that you don't actually need to know the difference between "negative" and "doublet," you can just assume everything under 0.9 is a doublet. For example, if your cell has two barcodes, then the percentage will be 40-60%.

I'll tell you that what I ended up doing for that dataset was just setting the lower extrema to the min instead of the mode, and that worked for that dataset. As Chris pointed out, there are a bunch of reasons to not do that in general with more appropriately distributed datasets (I also tested using the min on other datasets and the results were poor). In this case, it worked fine for that one dataset, and since then we've been able to generate better samples by refining our barcoding protocol (namely with early pooling).

Hi @skannan4

Thanks for the reply. In my case, I have barocode and cDNA libraries from Drop-seq experiment. Also, I don't have the reference cell barcodes list to start with. I am trying to figure out how to proceed for that part. Any thoughts or suggestion is highly appreciated,

Thanks.

Hi @rjg2186 ,

To set the thresholds manually, I would first visualize the distributions like so:

barTable.n <- as.data.frame(log2(barTable))
    for (i in 1:ncol(barTable)) {
        ind <- which(is.finite(barTable.n[, i]) == FALSE)
        barTable.n[ind, i] <- 0
        barTable.n[, i] <- barTable.n[, i] - mean(barTable.n[, 
            i])
    }

Where barTable is the output of MULTIseq.preProcess, excluding the nUMI and nUMI_total columns. You can then just visualize the distributions of each barcode using hist(), manually identify a suitable threshold, and then assign singlets, doublets, negatives using these threshold sets.

And as for your second question -- you'll need to analyze your gene expression data first to get a list of cellIDs that you predict are not empty droplets. You can then feed that list of cellIDs into the deMULTIplex pipeline.

Chris

Thanks for starting & following up on this thread.

I have the same "No threshold found" problem in datasets of 4 barcodes.
What would be a programmatic the best way to redefine the threshold in the MULTI-seq pipeline?

I have the following scaled distributions (as per @chris-mcginnis-ucsf )

BarcodeDistributions

image

I guess the problem is the lack of the higher local maxima – so I guess the 90% quantile would be a decent guess (comments appreciated).

My question is then, how this should be provided to the MULTI-seq pipeline? My guess would be to

Decompose as a Gaussian Mixture:

image

Where the threshold is defined as mean of lower peak + 2 * STD, that contains 97.9% of background cells.

Gaussian Mixture Models are estimated as:

require(mixtools)
require(MarkdownReports)

i=3
{
  # log2.bar.table <- log2(bar.table_TSC)
  log2.bar.table <- log2(bar.table_Ctrl[, 1:4])
  BT.logscaled <- log2.bar.table-rowMeans(log2.bar.table)
  set.seed(1989)

  pdfA4plot_on(pname = p0("Gaussian Decomposition of ",'CTRL'," Barcodes"), rows = 3)
  for (i in 1:ncol(BT.logscaled)) {
    Bar <- colnames(BT.logscaled)[i]

    my_mix <- normalmixEM(BT.logscaled[,i], k = 2) #telling it to find two gaussians in the observations
    mean_plus_2_std <- min(my_mix$mu) + 2*my_mix$sigma[which.min(my_mix$mu)]
    pTitle <- kollapse("Gaussian Decomposition of ", Bar,
                       " | above BG: ",
                       pc_TRUE(BT.logscaled[,i] > mean_plus_2_std)
    )

    plot(my_mix, which = 2, sub=pTitle) #looking at the model produced with base graphics
    abline(v=mean_plus_2_std, col=4, lwd=2)
    # str(my_mix)

  }
  pdfA4plot_off()
};
# From https://dozenoaks.twelvetreeslab.co.uk/2019/06/mixture-models/