holtjma/fmlrc

Comment on % Pacbio sequences corrected ?

Closed this issue · 3 comments

Hi,

I know this will depend on the dataset quality, but I will ask anyway. You compared quite extensively with the tool Lordec. In our (brief) experience Lordec is very quick and useful, but only keeps around 25-30 % of the Pacbio reads. In addition, many of the Pacbio reads are highly truncated. Short Pacbio reads are then not useful for assembly of complex plant genomes.

Maybe I missed it, but are there any stats on how your tool compares to Lordec for some of your test datasets ? i.e. % pacbio reads corrected, % excluded, % of initial read length corrected, etc..

Thanks,
Colin

Hey Colin,

We've actually had a different experience with LoRDEC thus far, namely it seems to report almost all of the reads as you'll see in the statistics below. Here is a table of number of PacBio reads in the output for all of our datasets:

Dataset Raw LoRDEC FMLRC
E. coli k12 82783 82663 82783
S. cerevisiae W303 216806 216490 216806
P. falciparum 3d7 242715 242711 242715
A. thaliana Ler-0 3758273 3758272 3758273

FMLRC will always output a read, and LoRDEC almost always outputs the reads for these datasets.

We don't have any direct statistics on the % corrected or % of initial corrected, but we do have N50s of the PacBio reads for three of the datasets:

Dataset Raw LoRDEC FMLRC
E. coli k12 7566 7285 7064
S. cerevisiae W303 8975 8792 8689
A. thaliana Ler-0 7205 7050 6867

The P. falciparum 3d7 dataset is excluded because the dataset is incredibly noisy, and we weren't able to assemble anything from its PacBio reads (raw or corrected with any method). While I don't have exact statistics for you, I recall from preliminary tests that LoRDEC and FMLRC were each able to make corrections to less than 40% of the PacBio reads for this dataset (I believe it was actually much lower than that, but I don't want to misreport on those stats). Despite this, LoRDEC is outputting almost all of the reads, and FMLRC always outputs all reads, but there is no guarantee that either tool was able to correct all of the outputted reads. This is especially true when there is a large amount of noise as in our P. falciparum dataset.

Does this help answer your questions?
Matt

Hi Matt,

thanks, that's a great answer.

I find this quite surprising, but guess it's down to the massive difference in repeat content between bacteria, small repeat poor plants like Arabidopsis, and the large highly repeat rich plant genomes we are working on.

I am not surprised P. falciparum was noisy, as the GC content is incredibly low.

I will see if I can find time and server resources to try FMLRC on a subset of Pacbio reads and report back.

Thanks,
Colin

closing due to inactivity