plger/scDblFinder

Request for clarification of dbr and dbr.sd in scDblFinder 1.15.1

Closed this issue · 4 comments

aghr commented

Dear scDblFinder Team,

Could you please help me to clarify the usage of the parameters dbr and dbr.sd of function scDblFinder().

  1. dbr: Setting dbr=0.01 reflects the assumption that 1% of 1k cells are doublets. The user should set dbr according to this scheme wrt 1k cell. scDblFinder then would increase dbr internally if the data set at hand consists of much more cells. This adaptation of dbr happens automatically. Is that right?
  2. dbr.sd: In the help message of scDblFinder() I find: "Set to dbr.sd=0 to disable." The GitHub README.md reads: "If you are unsure about the doublet rate, set dbr.sd=1 and the thresholding will be entirely based on the misclassification rates." The idea of both seems to disable dbr.sd. Can the user disable dbr.sd by setting dbr.sd=0 or dbr.sd=1 or through both ways?

Many thanks.
Andre

plger commented

Hi,

  1. If you don't set dbr, then internally it will be set based on the number of cells and the 1%/1k cells rule. However is you set dbr manually, then this rate will be used as is, i.e. it won't be scaled with the number of cells.
  2. You're right that was ambiguous, I've now updated the help to clarify this. Setting dbr.sd=0 will disable the uncertainty around the doublet rate, while setting to dbr.sd=1 will increase the uncertainty to the point of disabling the doublet rate altogether (thus letting the thresholding be entirely driven by the misclassification of artificial doublets).

Hope this helps,
plger

aghr commented

Thank you very much. I'd have another related question wrt. your point 1. Wouldn't that algorithm run into problems with very large data sets, say of more than 100k cells leading to dbr values greater than 1 (100%)? I expect such data sets to become common at some point. 10X announced a 1.3-Mio-cells data set in 2017 .

Thanks a lot again.
Andre

plger commented

Such large datasets are produced in multiple captures, so that each capture has only 12k cells or so. As indicated in the documentation, different captures should be processed separately in scDblFinder, for example using the samples argument, because the number of cells inputted in the machine in a given capture is the actual determinant of the expected number of doublets.

plger commented

If this answered your question, please close the issue.
Best,