Use of --explicit-pl vs --unfiltered-pl to get equivalent of raw_feature_bc_matrix.h5
Closed this issue · 4 comments
Hi
I'm looking for a bit of guidance on proper use of --explicit-pl
or --unfiltered-pl
. I would like to run CellBender
on my simpleaf
-mapped counts data, and to do that I need a matrix with lots of empty
barcodes.
I first tried with --unfiltered-pl
and the 10x whitelist here. However, this produced a matrix with ~6k barcodes. In the raw cellranger outputs (raw_feature_count_matrix.h5), we typically get more like 6M barcodes.
I then tried with --explicit-pl
and the same whitelist, and this time alevin returned a bit over 100k barcodes. However I am now worried that these are noisy...
This leads to a couple of questions:
- Am I right that
--unfiltered-pl
does some checking of whether a barcode might be a sequencing error, while--explicit-pl
does not? - So I guess that also means that
CellRanger
doesn't attempt to correct this? And as a result has a much longer tail of empties?
Or do I have this bit wrong? 😅
There are multiple alevin-fry
github issues with users asking questions in this direction, I think most commonly with a view getting inputs for CellBender
/ EmptyDrops
(#47, #71, #74, #113. So (if and when you have time!) it could be helpful to have a definitive answer to this in the main documentation. Possibly even a --raw
flag?
I guess a tricky aspect is that tools like CellBender
are typically written with CellRanger
inputs in mind, and you are trying to improve on CellRanger
... Perhaps a project for masters student to see if careful selection of parameters can get simpleaf
and CellBender
to place nicely together?
Thanks for all your work, and in particular the recent streamlining into simpleaf
!
Best
Will
Hi @wmacnair,
I'll draft a more in depth explaination shortly. In the meantime, --unfiltered-pl is what you are looking for. One distinction from CellRanger is that there is a --min-reads param in alevin-fry that always filters out cells with < that number of reads. By default it is 10. You can change it if you want. In the past, we have interacted with the CellBender folks and were told that this would almost certainly not be a problem, as our default cutoff of 10 is
very permissive.
Beat,
Rob
Ok, thanks.
I think sometimes it can be helpful to have the full set of barcodes to see the full knee plot curve. I don't think the barcodes at the bottom are ever used in CellBender, but their library sizes are used by CellBender to determine where the prior on the library size for empties should go.
(I also get the feeling that as CellBender is used more widely, they are coming across applications to datasets which don't fit the assumptions of CellBender so well (e.g. clinical single nuclei samples with substantial contamination.)
I'm now trying simpleaf
on some different data that we know more about. Hopefully there the --unfiltered-pl
approach will work nicely.
Cheers
Will
It turned out that this dataset included some files with very little RNA indeed. This made it look like simpleaf
was not finding any "empty" droplets using --unfiltered-pl
, but actually the problem was that there were barely any droplets to find, full stop.
I've since run it on all the samples, using the 10x barcode whitelist (here) and it works fine. Thanks!
Will
That’s great to hear. Thanks for reporting back and closing the issue. Let us know if you have any questions or suggestions in the future.