OpenGene/fastp

Deduplication issue: failure of one round deduplication + accuracy level issue. [fastp v0.23.4]

Opened this issue · 0 comments

Hi, there

We tried to use fastp to do de-duplication. However, we found 2 issues. Looking forward to your reply.

  1. one round of de-duplication is ineffective.
    we ran level 1 de-duplication and got "Duplication rate: 0.498141%". When we ran level 6 de-duplication on the input, we got "Duplication rate: 0.312492%". However, if we ran second round of de-duplication based on the output of first run. The Duplication rate can almost reach < 0.1%, see as below.

But
2) accuracy level issue:
we run level 1 de-duplication first and then using the output to run de-duplication at different accuracy levels.
As you can see, level 1 + level 1 -> 0.00744113%, level 1 + level 3 -> 0.088817% , level 1 + level 6 -> 0.0237203%, which doesn't make sense.

Read1 before filtering:
total reads: 15180846
total bases: 2277126900
Q20 bases: 2199749620(96.602%)
Q30 bases: 2075324182(91.1378%)

Read2 before filtering:
total reads: 15180846
total bases: 2277126900
Q20 bases: 2209710343(97.0394%)
Q30 bases: 2098006573(92.1339%)

Read1 after filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 after filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Filtering result:
reads passed filter: 30361692
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 623982
bases trimmed due to adapters: 2636132

Duplication rate: 0.498141%

Insert size peak (evaluated by paired-end reads): 226

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i /projects/f_lz332_1/DataBase/MetaGenomeData/Li_FrontMicro_2021_COVID/0.rawdata/ERR5445742_1.fastq.gz -I /projects/f_lz332_1/DataBase/MetaGenomeData/Li_FrontMicro_2021_COVID/0.rawdata/ERR5445742_2.fastq.gz -o ERR5445742_l1R1.fastq.gz -O ERR5445742_l1R2.fastq.gz --dedup --dup_calc_accuracy 1 --thread 16
fastp v0.23.4, time used: 80 seconds
Read1 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Read1 after filtering:
total reads: 15104100
total bases: 2264308814
Q20 bases: 2187263387(96.5974%)
Q30 bases: 2063424187(91.1282%)

Read2 after filtering:
total reads: 15104100
total bases: 2264308814
Q20 bases: 2197157837(97.0344%)
Q30 bases: 2085895844(92.1206%)

Filtering result:
reads passed filter: 30210448
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate: 0.00744113%

Insert size peak (evaluated by paired-end reads): 214

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i ERR5445742_l1R1.fastq.gz -I ERR5445742_l1R2.fastq.gz -o ERR5445742_l1l1R1.fastq.gz -O ERR5445742_l1l1R2.fastq.gz --dedup --dup_calc_accuracy 1 --thread 16
fastp v0.23.4, time used: 79 seconds
Read1 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Read1 after filtering:
total reads: 15091808
total bases: 2262463985
Q20 bases: 2185485043(96.5976%)
Q30 bases: 2061749882(91.1285%)

Read2 after filtering:
total reads: 15091808
total bases: 2262463985
Q20 bases: 2195369494(97.0345%)
Q30 bases: 2084200083(92.1208%)

Filtering result:
reads passed filter: 30210448
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate: 0.088817%

Insert size peak (evaluated by paired-end reads): 214

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i ERR5445742_l1R1.fastq.gz -I ERR5445742_l1R2.fastq.gz -o ERR5445742_l1l3R1.fastq.gz -O ERR5445742_l1l3R2.fastq.gz --dedup --dup_calc_accuracy 3 --thread 16
fastp v0.23.4, time used: 80 seconds
Read1 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Read1 after filtering:
total reads: 15101641
total bases: 2263938008
Q20 bases: 2186907014(96.5975%)
Q30 bases: 2063090824(91.1284%)

Read2 after filtering:
total reads: 15101641
total bases: 2263938008
Q20 bases: 2196799311(97.0344%)
Q30 bases: 2085557436(92.1208%)

Filtering result:
reads passed filter: 30210448
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate: 0.0237203%

Insert size peak (evaluated by paired-end reads): 214

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i ERR5445742_l1R1.fastq.gz -I ERR5445742_l1R2.fastq.gz -o ERR5445742_l1l6R1.fastq.gz -O ERR5445742_l1l6R2.fastq.gz --dedup --dup_calc_accuracy 6 --thread 16
fastp v0.23.4, time used: 85 seconds