Discrepency in filtering restults and reads after filtering
xapple opened this issue · 3 comments
Here is the head of the file stats_fastp.json
for a random single-end Illumina sequencing sample:
{
"summary": {
"fastp_version": "0.23.4",
"sequencing": "single end (75 cycles)",
"before_filtering": {
"total_reads":19014947,
"total_bases":1426121025,
"q20_bases":1368126463,
"q30_bases":1340057991,
"q20_rate":0.959334,
"q30_rate":0.939652,
"read1_mean_length":75,
"gc_content":0.501123
},
"after_filtering": {
"total_reads":10933431,
"total_bases":780019338,
"q20_bases":758743644,
"q30_bases":744983654,
"q20_rate":0.972724,
"q30_rate":0.955084,
"read1_mean_length":71,
"gc_content":0.498169
}
},
"filtering_result": {
"passed_filter_reads": 18724357,
"low_quality_reads": 1329,
"too_many_N_reads": 7,
"too_short_reads": 289254,
"too_long_reads": 0
},
after running it through fastp with the following command:
$ fastp --detect_adapter_for_pe --overrepresentation_analysis --dedup --correction --cut_right --thread 10 --in1 fwd.fastq.gz --out1 clean/fwd.fastq.gz --unpaired1 clean/fwd.fastq.singletons.fastq --html stats_fastp.html --json stats_fastp.json
We can see that after_filtering
there are 10'933'431 reads left in the cleaned FASTQ. However the filtering_result
category tells us that as many as 18'724'357 passed the filter. This is a huge mismatch. What happened to the 8 or so million reads? Why did they get removed?
I have the same issue here. It happened after I included flags to filter out the duplicated reads and low complexity reads. Without the two flags, the numbers seemed match each other
Read1 after filtering:
total reads: 9899483
total bases: 969140514
Q20 bases: 930274034(95.9896%)
Q30 bases: 849902013(87.6965%)
Read2 after filtering:
total reads: 9899483
total bases: 968730404
Q20 bases: 922809589(95.2597%)
Q30 bases: 846947592(87.4286%)
Filtering result:
reads passed filter: 19798966
reads failed due to low quality: 3232674
reads failed due to too many N: 206
reads failed due to too short: 111888936
reads with adapter trimmed: 58014749
bases trimmed due to adapters: 1885062968
Duplication rate: 79.9009%
Maybe it is due to the deduplication?
I have the same thing. I noticed that about 10 to 20 % of my metagenome reads is removed after deduplication. From the report it is unclear that those reads are removed, altough a percentage is given in the screen output of fastp.
When I ran fastp with dedup on the cleaned dataset, the deduplication level dropped to almost zero.
It would be really great if one line would be added to the report:
"reads removed because duplicated"