Comment on % Pacbio sequences corrected ?

Question

Comment on % Pacbio sequences corrected ?

Closed this issue 8 years ago · 3 comments

Hi,

I know this will depend on the dataset quality, but I will ask anyway. You compared quite extensively with the tool Lordec. In our (brief) experience Lordec is very quick and useful, but only keeps around 25-30 % of the Pacbio reads. In addition, many of the Pacbio reads are highly truncated. Short Pacbio reads are then not useful for assembly of complex plant genomes.

Maybe I missed it, but are there any stats on how your tool compares to Lordec for some of your test datasets ? i.e. % pacbio reads corrected, % excluded, % of initial read length corrected, etc..

Thanks,
Colin

Answer 1 · 2016-08-05T13:39:55.000Z

Hey Colin,

We've actually had a different experience with LoRDEC thus far, namely it seems to report almost all of the reads as you'll see in the statistics below. Here is a table of number of PacBio reads in the output for all of our datasets:

Dataset	Raw	LoRDEC	FMLRC
E. coli k12	82783	82663	82783
S. cerevisiae W303	216806	216490	216806
P. falciparum 3d7	242715	242711	242715
A. thaliana Ler-0	3758273	3758272	3758273

FMLRC will always output a read, and LoRDEC almost always outputs the reads for these datasets.

We don't have any direct statistics on the % corrected or % of initial corrected, but we do have N50s of the PacBio reads for three of the datasets:

Dataset	Raw	LoRDEC	FMLRC
E. coli k12	7566	7285	7064
S. cerevisiae W303	8975	8792	8689
A. thaliana Ler-0	7205	7050	6867

The P. falciparum 3d7 dataset is excluded because the dataset is incredibly noisy, and we weren't able to assemble anything from its PacBio reads (raw or corrected with any method). While I don't have exact statistics for you, I recall from preliminary tests that LoRDEC and FMLRC were each able to make corrections to less than 40% of the PacBio reads for this dataset (I believe it was actually much lower than that, but I don't want to misreport on those stats). Despite this, LoRDEC is outputting almost all of the reads, and FMLRC always outputs all reads, but there is no guarantee that either tool was able to correct all of the outputted reads. This is especially true when there is a large amount of noise as in our P. falciparum dataset.

Does this help answer your questions?
Matt

Answer 2 · 2016-08-05T13:52:09.000Z

Hi Matt,

thanks, that's a great answer.

I find this quite surprising, but guess it's down to the massive difference in repeat content between bacteria, small repeat poor plants like Arabidopsis, and the large highly repeat rich plant genomes we are working on.

I am not surprised P. falciparum was noisy, as the GC content is incredibly low.

I will see if I can find time and server resources to try FMLRC on a subset of Pacbio reads and report back.

Thanks,
Colin

Answer 3 · 2016-09-13T03:37:16.000Z

closing due to inactivity