nubeam-dedup
is a fast and easy-to-use bioinformatics tool removing exact PCR duplicates for sequencing reads, single-end or paired-end. We appreciate your interest in nubeam-dedup
. If you use nubeam-dedup
, please kindly cite:
Hang Dai and Yongtao Guan, Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping. Bioinformatics 36(10), P.3254-3256, (2020) DOI: 10.1093/bioinformatics/btaa112
Run the following commands:
On Linux
:
foo@bar:~$ wget --no-check-certificate --content-disposition https://github.com/daihang16/nubeamdedup/archive/master.zip
foo@bar:~$ unzip nubeamdedup-master.zip
foo@bar:~$ cd nubeamdedup-master/Linux/
On macOS
:
foo@bar:~$ curl -LJO https://github.com/daihang16/nubeamdedup/archive/master.zip
foo@bar:~$ unzip nubeamdedup-master.zip
foo@bar:~$ cd nubeamdedup-master/macOS/
Then:
foo@bar:Linux$ make && make clean
foo@bar:Linux$ ./nubeam-dedup -i1 ../toydata/1.fq.gz -i2 ../toydata/2.fq.gz 1> out.txt 2> log.txt
foo@bar:Linux$ cat log.txt
Output unique read pairs read 1 to nubeamdedup/Linux/1.uniq.fastq
Output unique read pairs read 2 to nubeamdedup/Linux/2.uniq.fastq
foo@bar:Linux$ cat out.txt
69221/142250 read pairs are unique.
foo@bar:Linux$ wc -l *.fastq
276884 1.uniq.fastq
276884 2.uniq.fastq
553768 total
You should see the expected output as above.
We also offer pre-compiled executable file for Linux. The executable file was compiled on Ubuntu 18.04.2 LTS by compiler gcc with the version of 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04). C++11 was used.
./nubeam-dedup -h
gives you the following messages:
./nubeam-dedup [-i -o -d -i1 -i2 -o1 -o2 -d1 -d2 -s -r -z -h]
Remove exact PCR duplicates for sequencing reads in (gzipped) fastq format.
Produces de-duplicated reads in fastq files.
Parameters for single-end (SE) reads:
--in or -i: Input file name. The parameter is mandatory for SE reads.
--out or -o: Output file name for unique reads. The default is input file name prefix appended with '.uniq.fastq(.gz)', under the current directory.
--duplicate or -d: File name for removed duplicated reads. The parameter is only valid when --remove or -r is set as 1 (see below). The default is input file name prefix appended with '.removed.fastq(.gz)', under the current directory.
Parameters for paired-end (PE) reads:
--in1 or -i1: Input file name for read 1 file. The parameter is mandatory for PE reads.
--in2 or -i2: Input file name for read 2 file. The parameter is mandatory for PE reads.
--out1 or -o1: Output file name for unique read pairs read 1 file. The default is read 1 file name prefix appended with '.uniq.fastq(.gz)', under the current working directory.
--out2 or -o2: Output file name for unique read pairs read 2 file. The default is read 2 file name prefix appended with '.uniq.fastq(.gz)', under the current working directory.
--duplicate1 or -d1: File name for removed duplicated read pairs read 1 file. The parameter is only valid when --remove or -r is set as 1 (see below). The default is input file name prefix appended with '.removed.fastq(.gz)', under the current directory.
--duplicate2 or -d2: File name for removed duplicated read pairs read 2 file. The parameter is only valid when --remove or -r is set as 1 (see below). The default is input file name prefix appended with '.removed.fastq(.gz)', under the current directory.
Miscellaneous parameters:
--strand or -s: Whether take reads from complementary strand into account. Accept boolean 1 (default) or 0.
--remove or -r: Whether output removed duplicated reads. Accept boolean 0 (default) or 1.
--gz or -z: Compression level of output file. Accept integer 0 (default) to 9. If 0 (default), the output data will not be compressed and will be written to plain text file; otherwise, the output data will be written to gzip format file, with the compression level suggested by user. If compression is needed, a compression level less than 3 is recommended as a compromise between speed and compression.
-h: print this help
-
For single-end reads
-
Consider reads from complementary strand (default)
./nubeam-dedup -i read.fq
The command gives the following output on screen:
Output unique reads to /current/working/directory/read.uniq.fastq x/y reads are unique.
-
Do not consider reads from complementary strand
./nubeam-dedup -i read.fq -s 0
-
Consider reads from complementary strand (default), output gzipped file with a compression level of 6, output removed duplicated reads
./nubeam-dedup -i read.fq -z 6 -r 1
The command gives the following output on screen:
Output removed duplicated reads to /current/working/directory/read.removed.fastq.gz Output unique reads to /current/working/directory/read.uniq.fastq.gz x/y reads are unique.
-
-
For paired-end reads
-
Consider reads from complementary strand (default)
./nubeam-dedup -i1 read1.fastq.gz -i2 read2.fastq.gz
The command gives the following output on screen:
Output unique read pairs read 1 to /current/working/directory/reads1.uniq.fastq Output unique read pairs read 2 to /current/working/directory/reads2.uniq.fastq x/y read pairs are unique.
-
Do not consider reads from complementary strand
./nubeam-dedup -i1 read1.fastq.gz -i2 read2.fastq.gz -s 0
-
Consider reads from complementary strand (default), output gzipped file with a compression level of 2, output removed duplicated reads
./nubeam-dedup -i1 read1.fastq.gz -i2 read2.fastq.gz -z 2 -r 1
The command gives the following output on screen:
Output removed duplicated read pairs read 1 to /current/working/directory/read1.removed.fastq.gz Output removed duplicated read pairs read 2 to /current/working/directory/read2.removed.fastq.gz Output unique read pairs read 1 to /current/working/directory/reads1.uniq.fastq.gz Output unique read pairs read 2 to /current/working/directory/reads2.uniq.fastq.gz x/y read pairs are unique.
-
- A large value (like 6) for
-z
tag might significantly increase the running time. From Figures 1, 2 and 7 in this post,-z 6
would increase the amount of time by a factor of 2.5-3 compared with-z 1
(with a limited gain regarding to compression ratio); and-z 1
would increase the amount of time by a factor of 2.5 compared with-z 0
, which is the default setting ofnubeam-dedup
. The recommended practice is: either use a smaller compression level (1-3) or do not use the-z
tag at all. For the latter choice, if compression was required, pigz could be used afternubeam-dedup
finishes---this can significantly accelerate the compression. - For the convenience of users,
nubeam-dedup
outputs the number of unique reads and total reads tostdout
and the output file information tostderr
. Use1>
and2>
to redirect the two streams respectively.