lanagarmire/SSrGE

10x genomics data demultiplexing

coh-racng opened this issue · 4 comments

Under the Methods section of your paper, the section '10x genomic data preprocessing' says that you de-multiplexed the reads using the BC tags extracted with the pysam library (http://pysam.readthedocs.io/en/latest/api.html).

Can you provide code for how you demultiplexed the I1, R1, and R2 fasta files that have all the cells already combined? How did you separate the fasta files into single cell fasta files before feeding them into your SNV calling pipeline? Or did you just process possorted_genome_bam.bam from cellranger count output and split it into single-cell bam files?

How did you separate the fasta files into single cell fasta files before feeding them into your SNV calling pipeline? Or did you just process possorted_genome_bam.bam from cellranger count output and split it into single-cell bam files?

Since I face the same issue, I would be super glad to hear how the authors dealt with this issue and whether they could provide the code?

Thank you so much!

Using samtools view to scan through the BAM file once for each cell barcode can take quite a while with time complexity of O(n) for n cells. It took around 1-2 minute for each cell and so it would have taken a few days to demultiplex a BAM file with thousands of cells. A faster solution that takes ~O(n/limit) time is to use pysam to read each line of the BAM file and write it to a new single-cell BAM file (inspired by https://github.com/yanailab/celseq2/blob/master/celseq2/demultiplex_sam.py), where limit is the maximum number of open files allowed on your system. You would demultiplex in batches of barcodes.

To use the 10x dataset through my SNV calling pipeline, I used the BAM file from the 10x datasets downloading page. The only issue is that the 10x BAM file pull all the file together. So I used a script to "demultiplex" the bam file and create fastq files for each cell in different folders. Then, in a second time I re-align with STAR and used my pipeline to call SNVs. I have added the script to the git rep here: https://github.com/lanagarmire/SSrGE/blob/master/garmire_SNV_calling/parse_10x_bam_file_to_fastq_files.py

You have to change the global variables in upper letters and use your own files and directory. I'm sorry that the pipeline is not more convenient for 10x format but I originally designed it for processing fluidigm datasets that contain a different file for each cell. Another thing is the study we did was more focused on the statistical processing of the SNVs. Also, maybe other SNVs callers such as vartrix might be more relevant to call SNVs from 10x data?