/mouse_filter

Filter mouse reads from a Human PDX derrived .bam

Primary LanguagePython

.bam filtering tool for PDX samples

The goal is to remove all host DNA from a .bam, and regenerate the FASTQ files for the standard alignment and variant calling pipeline. The assumed process leading up to this point:

  1. Build custom model organism genome:
  2. Sequence normal tissue from the Xenograft model organism (e.g. Mouse).
  3. Generate a list of germline variants comparing the Mouse strain to mm10.
  4. Generate a custom reference by altering mm10 with the germline variants.
  5. Align PDX sample reads to the custom reference

At this point, we need to filter out all the reads that appear to have come from the host:

  1. Isolate reads from all unaligned and imperfect alignments
  • Output:
    • Human_1.fq.gz
    • Human_2.fq.gz
    • ambiguous.bam
  1. Investigate the ambiguous reads with Strain specific annotation, likely this will be added in to the Fastqs.

The script can be run in two modes:

Python

usage: read_bam.py [-h] [-b BAM] -o OUTPUT [-c COMPRESSION]

Detect and isolate human reads from a bam file generated from human(SEQ)
aligned to mouse(REF). Accepts either: a file, or sam data piped from stdin.
NOTE: when reading from stdin, you must provide the SAM headers "@" via
samtools' -h flag.

optional arguments:
  -h, --help      show this help message and exit
  -b BAM          Input .bam (unsorted) [stdin]
  -o OUTPUT       Output stub e.g. Human.fastq
  -c COMPRESSION  Optional fq.gz compression rate [default: 4]

Bash

$ ./stream_run.sh 
-b    Bam file to be processed
-o    output file stub
-h    this message

This method has some performance gains by utilizing standard streams to pipe fastq data through gzip.

  • stdout -> _1.fq.gz
  • stderr -> _2.fq.gz

Logging information is captured in runlog.txt, however, error messages will be sent to _2.fq.gz