HuffordLab/NAM-genomes

About pacbio's raw reads

YilinZhang449 opened this issue · 8 comments

Hello,
I'm a little confused, is the pacbio sequencing here generating CLR data or CCS data? How many G base of sequencing data per sample on average?

Thank you!

Hello @YilinZhang449,

We used PacBio reads. Specifically, CLR data (Subreads) was used. Please see PacBio_stats_v2.1_NAM_genomes.xlsx spreadsheet for details regarding coverage.

Thanks,

Hello @YilinZhang449,

We used PacBio reads. Specifically, CLR data (Subreads) was used. Please see PacBio_stats_v2.1_NAM_genomes.xlsx spreadsheet for details regarding coverage.

Thanks,

Hello,

What's the meaning of "processed" at subreads, it seems all the pacbio CLRs passed the QC step by SequelTools.

Thank you

Tian

Hello @YilinZhang449,
We used PacBio reads. Specifically, CLR data (Subreads) was used. Please see PacBio_stats_v2.1_NAM_genomes.xlsx spreadsheet for details regarding coverage.
Thanks,

Hello,

What's the meaning of "processed" at subreads, it seems all the pacbio CLRs passed the QC step by SequelTools.

Thank you

Tian

Also, if you're referring to the spreadsheet headers, they are simplified terms for users to help understand what they are looking at: Polymerase Reads (raw) vs. Sub Reads (processed). As mentioned in the methods (supplementary), no processing was performed on subreads other than error correction itself.

Thanks,

Hello @YilinZhang449,
We used PacBio reads. Specifically, CLR data (Subreads) was used. Please see PacBio_stats_v2.1_NAM_genomes.xlsx spreadsheet for details regarding coverage.
Thanks,

Hello,

What's the meaning of "processed" at subreads, it seems all the pacbio CLRs passed the QC step by SequelTools.

Thank you

Tian

I download the B73 CLRs (ERR3288278,ERR3288281,ERR3288284,ERR3288287,ERR3288290,ERR3288293,ERR3288279,ERR3288282,ERR3288285,ERR3288288,ERR3288291,ERR3288294,ERR3288280,ERR3288283,ERR3288286,ERR3288289,ERR3288292,ERR3288295), but the total bases were 195,282,635,506.
Is it right? Is it under quality control? How?

Hi @ttian627:

Sorry, I mislabelled them. The "Raw" is subreads, and the "Processed" is Falcon error-corrected reads. If you look at the coverage stats, it uses the Subreads (labeled as "Raw") for calculations. This is true for all the tabs.
Falcon error correction was performed per the instructions provided here.

Thanks,

When I sum the "Raw" reads (wrongly labeled - they are subreads), I get pretty close numbers to yours.
image
It could be that while downloading, you may have enabled some filtering that might have trimmed certain bases. You can calculate bases for each file and compare them to the table in the excel sheet, but I wouldn't worry about a little difference in the total count.