Fastqdedup deduplicates FASTQ files with UMI data while taking sequencing errors into account. Fastqdedup is based on HUMID and was created to explore an alternative implementation. HUMID is currently more actively developed.
Fastqdedup can operate on one or multiple FASTQ files. So single reads with prepended UMIs, paired reads with prepended UMIs or paired reads with separate UMI file are all possible.
It executes the following steps:
- Read all sequences
- Filter low quality sequences (can be turned off).
- Determine clusters based on Hamming distance (or Levenshtein when set).
- Determine which reads are distinct in each cluster. Put these in a set. This uses the 'directional' method as described in the UMI-tools paper. Several methods are available.
- Read all sequence records again from the FASTQ files. Check against the set. If a sequence is present the sequence record is output to the output FASTQ files and the sequence is removed from the set so that it is not included again.
To install the latest development version:
pip install git+https://github.com/rhpvorderman/fastqdedup.git
Since all the sequences are stored in memory the memory usage is between
approximately 45-110% of the size of the original uncompressed FASTQ files
depending on read length and similarities between reads.
The memory usage can be substantially reduced by setting --check-lengths
.
usage: fastqdedup [-h] [-l CHECK_LENGTHS] [-o OUTPUT] [-p PREFIX] [-d MAX_DISTANCE] [-e MAX_AVERAGE_ERROR_RATE] [-E] [--edit] [-c {highest_count,adjacency,directional}] [-v] [-q] FASTQ [FASTQ ...] positional arguments: FASTQ Forward FASTQ and optional reverse and UMI FASTQ files. optional arguments: -h, --help show this help message and exit -l CHECK_LENGTHS, --check-lengths CHECK_LENGTHS Comma-separated string with the maximum string check length of each file. For example 'fastqdedup --check- lengths 16,8 R1.fastq R2.fastq' only checks the first 16 bases of R1 and the first 8 bases of R2 for duplication. Supports slice notation such as '4:8' or '::8'. -o OUTPUT, --output OUTPUT Output file (optional), must be specified multiple times for multiple input files. For example ``fastqdedup -o dedupR1.fastq -o dedupR2.fastq R1.fastq R2.fastq``. -p PREFIX, --prefix PREFIX Prefix for the output files. Default: 'fastqdedup_R' -d MAX_DISTANCE, --max-distance MAX_DISTANCE The Hamming distance at which inputs are considered different. Default: 1. -e MAX_AVERAGE_ERROR_RATE, --max-average-error-rate MAX_AVERAGE_ERROR_RATE The maximum average per base error rate for each FASTQ record. Average is evaluated over bases taken into account by --check-lengths.Default: 0.001 -E, --no-average-error-rate-filter Do not filter on average per base error rate. --edit Use edit (Levenshtein) distance instead of Hamming distance. -c {highest_count,adjacency,directional}, --cluster-dissection-method {highest_count,adjacency,directional} How to approach clusters with multiple reads. 'highest_count' selects only one read, the one with the highest count. 'adjacency' starts from the read with the highest count and selects all reads that are within the specified distance. The process is repeated for the remaining reads. 'directional' is similar to adjacency but uses counts to determine if an error is a PCR/sequencing artifact or derived from a difference in the molecule (default). -v, --verbose Increase log verbosity. -q, --quiet Reduce log verbosity.
Fastqdedup was heavily inspired by @jfjlaros's trie implementation. A trie is a way to store sequences in a tree format. A disadvantage of this format is that memory usage is quite high since every character is stored as a node in the tree, and a node uses a lot more space than a single character.
Fastqdedup utilizes the following optimizations.
- Radix tree suffixes. While a lot of similarities between FASTQ files are basically guaranteed across the first 8 characters (65536 possible sequences) the amount of possible sequences explodes after that. Therefore a trie of FASTQ sequences contains a lot of branches that consists of nodes that only link one parent to one child for quite a long length. Fastqdedup stores terminating branches as strings rather than nodes, saving a lot of memory.
- Variable size nodes. The trie is initialized with an
ACGTN
alphabet, whereA
has index0
,C
has index1
etc. Nodes are sized such that they can contain the character with the highest index that is in the node. - Fast fail Hamming (or Levenshtein) distances. Since the most likely result is that the sequence is not seen before, the algorithm gives up as quickly as it finds the Hamming distance will be exceeded.
Illumina reads have an error rate of approximately 1 in 1000. This means that
the chance of a sequencing error in a 150bp illumina read is
1 - ((1 - (1/1000)) ^ 150) ~= 14%
. This means that most (86%) illumina
reads are correct and that it is hard to distinguish between reads with one
SNP and reads with one sequencing error. It is also hard to distinguish between
sequencing replicates and actual biological replicates.
Unique Molecular Identifiers (UMIs) solve this problem by prepending each read
with a short sequence of bases. With a UMI length of 8, there are 65536
(4^8
) possible UMIs. The chance of two biological replicates having the
same UMI is 1 in 65536. Therefore two sequences with the same UMI and only one
base pair different are probably derived from the same biological replicate
since there is a 14% chance of a sequencing error.