10XGenomics/subset-bam

Per-cell bams or fastqs from the mega-bam?

olgabot opened this issue · 1 comments

Hello,
Would it be possible to create 1000s of per-cell fast{q,a}s (preferred) or per-cell bams from the mega-bam of the 10x channel run? We wrote a tool to do this in Python, but it is slow and takes a lot of memory: http://github.com/czbiohub/bam2fasta In the end, performing 1000s of grep jobs was faster than doing it in one go with Python.
Thank you!
Warmest,
Olga

Hello! Any updates on this? This is my current workaround in the kmermaid pipeline:

Extract aligned reads to fastq:

    samtools view -ub -F 4 ${bam} \\
        | samtools fastq --threads ${task.cpus} -T ${tenx_tags} - \\
        | gzip -c - \\
          > ${reads}

Extract unaligned reads to fastq:

    samtools view -f4 ${bam} \\
      | grep -E '${tenx_cell_barcode_pattern}' \\
      | samtools fastq --threads ${task.cpus} -T ${tenx_tags} - \\
      | gzip -c - \\
        > ${reads} \\
      || touch ${reads}

Then running bam2fasta count_umis_per_cell:

        bam2fasta count_umis_percell \\
            --filename ${reads} \\
            --min-umi-per-barcode ${tenx_min_umi_per_cell} \\
            --cell-barcode-pattern '${tenx_cell_barcode_pattern}' \\
            --molecular-barcode-pattern '${tenx_molecular_barcode_pattern}' \\
            --write-barcode-meta-csv ${umis_per_cell} \\
            --barcodes-significant-umis-file ${good_barcodes}

Then using ripgrep to search for each cell barcode in the aligned and unaligned fastqs which has UMIs above a certain threshold:

    rg \\
      --search-zip \\
      --after-context 3 \\
      --threads ${task.cpus} \\
      '${this_cell_barcode}' \\
      ${reads} \\
      | gzip -c - \\
      > ${this_cell_fastq_gz} || touch ${this_cell_fastq_gz}

A one-stop-shop solution that would use Rust and be 10-100x faster would be much preferred :)

Thank you!