/umi-dedup

Primary LanguagePythonMIT LicenseMIT

UMI-dedup

Dependencies

UMI-dedup is tested mainly in Python 3 and not guaranteed to work in Python 2. It requires the Python modules numpy, numba, and pysam.

Programs

Run any program with the -h argument for detailed help on command-line parameters.

extract_umi.py

This program extracts UMIs from Illumina sequence reads and adds them to the read metadata. It reads and writes in FASTQ format and can be used with streams. This step is necessary before the FASTQ files is aligned to a reference genome.

dedup.py

This program marks or removes duplicate Illumina sequence reads in an aligned BAM file. It first detects optical duplicates by position on the flow cell, then detects PCR duplicates by UMI sequences. At least one read with a given UMI that starts at a given genome position is assumed to be "distinct", and others may be treated as duplicates of the same template molecule or also considered as arising from distinct molecules depending on the algorithm.

  • naive: strictly one distinct read per UMI; this underestimates when the number of reads at a given position is high and the number of possible UMI sequences is low.
  • weighted_average: average of the naive non-duplicate count and the total read count, weighted by the number of UMIs observed once vs. those not observed.
  • weighted_average2 (default): like weighted_average but each observed count votes for a limit 1 higher than itself, weighted by the number of UMIs observed with that count
  • cluster: fit a Poisson mixture model to each genomic position, with the number of clusters determined using the Bayesian Information Criterion (BIC). Each cluster corresponds to a different number of original reads.

dedup.py requires the input BAM file to have UMI annotations in the read names, as generated by extract_umi.py, and it requires the BAM file to be sorted by coordinate, though it does not need to be indexed. It outputs a BAM file that is still sorted by coordinate, with duplicates marked in the 0x400 flag bit. It can be used with streams.

make_frequency_table.py

This program makes a table of UMI frequencies in a given data set. It outputs a simple plaintext table and can read either FASTQ or BAM data, but it is better to use the aligned BAM data than unaligned FASTQ including reads that may be misleading technical artifacts.

API

markdup_sam.DuplicateMarker

This class is a generator that takes an iterable of pysam.AlignedSegment instances (such as a pysam.Samfile) and yields pysam.AlignedSegment instances in the same order but with duplicates marked. As such it is essentially a stream editor.