Mark duplicate reads from PacBio sequencing of an amplified library
pbmarkdup takes one or multiple sequencing chips of an amplified libray as HiFi reads and marks or removes duplicates.
Latest pbmarkdup
can be installed via bioconda package pbmarkdup
.
Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.
Version 1.0.3: Full changelog here
Input: HiFi reads from one or multiples movies in PacBio BAM (.ccs.bam
),
PacBio dataset (.consensusreadset.bam
), file of file names (.fofn
),
FASTQ (optionally gzipped), or FASTA (optionally gzipped) format.
Output: HiFi reads with duplicates marked in a format inferred from the
file extension: HiFi BAM (.bam
); FASTQ (.fastq
); FASTQ (.fasta
);
bgzipped FASTQ (.fastq.gz
); bgzipped FASTA (.fasta.gz
); or SMRT Link XML
(.consensusreadset.xml
) which also generates a corresponding BAM file.
Run on a single movie:
pbmarkdup movie.ccs.bam output.bam
Run on multiple movies
pbmarkdup movie1.fasta movie2.fasta output.fasta
Run on multiple movies and output duplicates in separate file
pbmarkdup movie1.ccs.bam movie2.fastq uniq.fastq --dup-file dups.fasta
pbmarkdup employs a reference-free HiFi comparison scheme to avoid biases from alignment-based methods. Naïve implementation would perform all-vs-all alignments to determine if two reads originate from the identical molecule and are only due to PCR duplication. This would be abysmally slow. Instead, pbmarkdup relies on a hierarchical comparison workflow to rapidly detect reads that are likely from the same molecule.
- Compare each read to all reads that are smaller in length but at most 10 bp
- Check if their minimizers are similar
- Perform alignment of the first and last 250 bp in both directions
- Cluster reads and deterministically determine cluster representative
Additional heuristics are employed to enable ultra-fast comparison.
A low-memory footprint has been a primary focus of pbmarkdup. In order to
enable rapid comparison of HiFi reads, all reads have to be stored in memory.
As an example, a 20x human dataset would consume 60 GB memory just to store the
sequences plus minimizers which can be a multiple, resulting in >200 GB memory
consumption.
Alternatively, pbmarkdup only stores the first and last 250 bp of each read
with 2 byte/base and aggressively thins minimizers. This results in 8 GB memory
for a 60 GB dataset (unreleased version 0.3.0).
In order to keep memory footprint to a minimum, we trade reading input files twice instead of storing everything in memory. The goal was to support processing multiple movies with a standard server.
No. The purpose of this tool to mark PCR duplicates for amplicon HiFi libraries after CCS calling.
Duplicates have annotated read names in FASTA and FASTQ; FASTA example
>m64013_191117_050515/67104/ccs DUPLICATE=m64013_191117_050515/3802014/ccs DS=2
shows a marked duplicate read m64013_191117_050515/67104/ccs
that is a duplicate
of m64013_191117_050515/3802014/ccs
and the this molecule has been sequenced
2
times. Accordingly, the read that has been selected as the representative
of the molecule:
>m64013_191117_050515/3802014/ccs DS=2
In BAM format, tags ds:i:
provides the number of reads of this molecule and
du:Z:
the name of the representative read.
Yes, use --log-level INFO
and get following summary:
LIBRARY READS UNIQUE MOLECULES DUPLICATE READS
---------- ---------- ------------------ -----------------
<Unnamed> 25000 24948 (99.8%) 52 (0.2%)
SS-lib 496 493 (99.4%) 3 (0.6%)
---------- ---------- ------------------ -----------------
TOTAL 25496 25441 (99.8%) 55 (0.2%)
For each library, BAM @RG
library tag LB
or <Unnamed>
for FASTA/Q input,
number and percentages of unique and duplicate reads are listed per row.
Use --cross-library
to mark agnostic of library names.
Use --dup-file FILE
to store duplicates in an extra file. The format of this
file can be different to the output file.
Use --rmdup
.
Input as rows, outputs as columns:
IN/OUT | BAM | DATASET | FASTQ | FASTA |
---|---|---|---|---|
BAM | x | x | x | x |
DATASET | x | x | x | x |
FASTQ | x | x | ||
FASTA | x |
Allowed combination example:
pbmarkdup movie1.ccs.bam movie2.fastq movie3.fasta out.fasta
Forbidden combination example:
pbmarkdup movie2.fastq movie3.fasta out.fastq
Yes. With --log-level INFO
, pbmarkdup provides status to stderr
.
PacBio® tool pbmarkdup, distributed via Bioconda, is licensed under BSD-3-Clause-Clear.
- 1.0.3:
- Revio release for SMRT Link 12.0
- 1.0.2:
- Official SMRT Link 10.0 release
- Minor internal fix in memory compression
- 1.0.0:
- Official SMRT Link 9.0 release
- Improve memory consumption
- 0.2.0:
- Initial release
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.