COMBINE-lab/alevin-fry

How to keep all the cellbarcodes?

Closed this issue · 4 comments

Hi Alevin-fry developers,

I am currently working on a combinatory indexing scRNA-seq dataset. And given that my dataset would have many ambient RNA, I would like to use the empty droplets for background estimation. So I am wondering if there is a way to keep all the cellbarcodes, no filtering, sth like the raw_feature_bc_matrix generated from 10x cellranger?

I initially thought with setting naiveEqclass and keepCBFraction 1.0 parameters would do the work but somehow the outputs are the same. I am wondering if there is sth I missed when running alevin-fry?

The codes I used:

salmon alevin -i /alevin_index_human_mouse/salmon_idx_humanv31_mouseM25_wo_decoy -l A -1 R1_fastq.gz -2 R2_fastq.gz \
        --naiveEqclass --keepCBFraction 1.0 \
        --read-geometry 2[1-end] \
        --umi-geometry 1[xx-xx] \
        --bc-geometry 1[xx-xx,xx-xx, xx-xx] \
        -p 20 -o ./alevin_RNA_fry --rad

alevin-fry generate-permit-list -d both -i ./alevin_RNA_fry --output-dir ./alevin_out_permit_knee -k
alevin-fry collate -r ./alevin_RNA_fry -t 16 -i ./alevin_out_permit_knee
alevin-fry quant -m /alevin_index_human_mouse/GRCh38-and-mm10_tsvtogene_all.tsv -i ./alevin_out_permit_knee -o ./alevin_counts -t 16 --resolution trivial --use-mtx

Thanks a lot for your help in advance.

Best,
Bingjie

rob-p commented

Hi @BingjieZhang,

Thanks for your description. The issue here has nothing to do with the invocation of the alevin command, but the fact that you are using the knee filtering in the alevin-fry generate-permit-list command. The knee filtering will only let in high read abundance (proxy for high confidence) cells.

You have a few options to get a less filtered list. One option is to look at the knee plot and then run again using —force-cells to determine how many cells you want to let through. The other is, given your combinatorial setup, to create an unfiltered permit-list, and then use the —unfiltered-pl flag to match against that list of possible barcodes. You can find the documentation on generate-permit-list here.

Let us know if you have any follow up questions.

Best,
Rob

Hi Rob,

Thanks a lot for your response. It worked well with -force-cells! I just have too many cells with too shallow sequencing depth. Now the cell number looks reasonable to me. Thanks again for your help!

I had exactly the same question, so thanks very much for this explanation.

Two small follow-ups:

  • What's the best way to do a knee plot showing all (not just the knee-filtered) barcodes?
  • The knee filtering approach works just on the basis of the distribution of droplet library sizes, while the EmptyDrops approach uses a statistical test against an empty population. I wondered if you had done any analysis of the effects they might have. Obviously the EmptyDrops paper would claim that you need the test, but maybe it rarely makes much difference in practice? Any pointers or thoughts would be great :)

Thanks!
Will

[Edit:

I've now discovered the discussion of knee vs EmptyDrops in the SI of the alevin-fry paper, which is helpful 😄 .

The conclusion there (although only based on one dataset) is that EmptyDrops is more permissive than knee, but that the quality of the clustering is worse when you do this. (You also conclude that choice of permit list method is more important than choice of UMI deduplication method.)

I'm thinking about circumstances where this might not be the best approach. I've recently been working with single nuclei RNAseq, where contamination by ambient RNA soup is a common problem (especially in human disease samples). Also the barcode knee plots are often not very clear. Here, you could imagine the knee approach identifying the larger and cleaner nuclei and excluding ambient-contaminated nuclei, while a more permissive permit list combined with an ambient RNA removal approach (such as scAR) could "rescue" those nuclei.

But I don't know - would be interesting to see this analysis!]

Thinking a bit more, it might be worth adding this as a feature request. What generate-permit-list (typically) does is to keep all barcodes above a certain abundance threshold, while what is needed here is an option to keep all barcodes below a certain abundance threshold.

This is a pretty common use case, especially in single nuclei RNAseq, and I hope would be pretty straightforward to add. Would be great if you'd consider adding it to a release at some point!

Cheers
Will