Discrepency in filtering restults and reads after filtering

Question

Discrepency in filtering restults and reads after filtering

xapple opened this issue a year ago · 3 comments

Here is the head of the file stats_fastp.json for a random single-end Illumina sequencing sample:

{
        "summary": {
                "fastp_version": "0.23.4",
                "sequencing": "single end (75 cycles)",
                "before_filtering": {
                        "total_reads":19014947,
                        "total_bases":1426121025,
                        "q20_bases":1368126463,
                        "q30_bases":1340057991,
                        "q20_rate":0.959334,
                        "q30_rate":0.939652,
                        "read1_mean_length":75,
                        "gc_content":0.501123
                },
                "after_filtering": {
                        "total_reads":10933431,
                        "total_bases":780019338,
                        "q20_bases":758743644,
                        "q30_bases":744983654,
                        "q20_rate":0.972724,
                        "q30_rate":0.955084,
                        "read1_mean_length":71,
                        "gc_content":0.498169
                }
        },
        "filtering_result": {
                "passed_filter_reads": 18724357,
                "low_quality_reads": 1329,
                "too_many_N_reads": 7,
                "too_short_reads": 289254,
                "too_long_reads": 0
        },

after running it through fastp with the following command:

$ fastp --detect_adapter_for_pe --overrepresentation_analysis --dedup --correction --cut_right --thread 10 --in1 fwd.fastq.gz --out1 clean/fwd.fastq.gz --unpaired1 clean/fwd.fastq.singletons.fastq --html stats_fastp.html --json stats_fastp.json

We can see that after_filtering there are 10'933'431 reads left in the cleaned FASTQ. However the filtering_result category tells us that as many as 18'724'357 passed the filter. This is a huge mismatch. What happened to the 8 or so million reads? Why did they get removed?

Answer 1 · 2023-10-19T01:12:57.000Z

I have the same issue here. It happened after I included flags to filter out the duplicated reads and low complexity reads. Without the two flags, the numbers seemed match each other

Read1 after filtering:
total reads: 9899483
total bases: 969140514
Q20 bases: 930274034(95.9896%)
Q30 bases: 849902013(87.6965%)

Read2 after filtering:
total reads: 9899483
total bases: 968730404
Q20 bases: 922809589(95.2597%)
Q30 bases: 846947592(87.4286%)

Filtering result:
reads passed filter: 19798966
reads failed due to low quality: 3232674
reads failed due to too many N: 206
reads failed due to too short: 111888936
reads with adapter trimmed: 58014749
bases trimmed due to adapters: 1885062968

Duplication rate: 79.9009%

Maybe it is due to the deduplication?

Answer 2 · 2023-10-23T13:48:14.000Z

I am seeing also a discrepancy in those results. I do have --dedup parameter when I run fastp, but if duplicates are being removed, maybe the final results should reflect that.

Answer 3 · 2024-07-11T10:58:11.000Z

I have the same thing. I noticed that about 10 to 20 % of my metagenome reads is removed after deduplication. From the report it is unclear that those reads are removed, altough a percentage is given in the screen output of fastp.

When I ran fastp with dedup on the cleaned dataset, the deduplication level dropped to almost zero.

It would be really great if one line would be added to the report:
"reads removed because duplicated"