Per-cell bams or fastqs from the mega-bam?

Question

Per-cell bams or fastqs from the mega-bam?

olgabot opened this issue 4 years ago · 1 comments

Hello,
Would it be possible to create 1000s of per-cell fast{q,a}s (preferred) or per-cell bams from the mega-bam of the 10x channel run? We wrote a tool to do this in Python, but it is slow and takes a lot of memory: http://github.com/czbiohub/bam2fasta In the end, performing 1000s of grep jobs was faster than doing it in one go with Python.
Thank you!
Warmest,
Olga

Answer 1 · 2021-03-08T21:31:32.000Z

Hello! Any updates on this? This is my current workaround in the kmermaid pipeline:

Extract aligned reads to fastq:

    samtools view -ub -F 4 ${bam} \\
        | samtools fastq --threads ${task.cpus} -T ${tenx_tags} - \\
        | gzip -c - \\
          > ${reads}

Extract unaligned reads to fastq:

    samtools view -f4 ${bam} \\
      | grep -E '${tenx_cell_barcode_pattern}' \\
      | samtools fastq --threads ${task.cpus} -T ${tenx_tags} - \\
      | gzip -c - \\
        > ${reads} \\
      || touch ${reads}

Then running bam2fasta count_umis_per_cell:

        bam2fasta count_umis_percell \\
            --filename ${reads} \\
            --min-umi-per-barcode ${tenx_min_umi_per_cell} \\
            --cell-barcode-pattern '${tenx_cell_barcode_pattern}' \\
            --molecular-barcode-pattern '${tenx_molecular_barcode_pattern}' \\
            --write-barcode-meta-csv ${umis_per_cell} \\
            --barcodes-significant-umis-file ${good_barcodes}

Then using ripgrep to search for each cell barcode in the aligned and unaligned fastqs which has UMIs above a certain threshold:

    rg \\
      --search-zip \\
      --after-context 3 \\
      --threads ${task.cpus} \\
      '${this_cell_barcode}' \\
      ${reads} \\
      | gzip -c - \\
      > ${this_cell_fastq_gz} || touch ${this_cell_fastq_gz}

A one-stop-shop solution that would use Rust and be 10-100x faster would be much preferred :)

Thank you!