Comment on % Pacbio sequences corrected ?
Closed this issue · 3 comments
Hi,
I know this will depend on the dataset quality, but I will ask anyway. You compared quite extensively with the tool Lordec. In our (brief) experience Lordec is very quick and useful, but only keeps around 25-30 % of the Pacbio reads. In addition, many of the Pacbio reads are highly truncated. Short Pacbio reads are then not useful for assembly of complex plant genomes.
Maybe I missed it, but are there any stats on how your tool compares to Lordec for some of your test datasets ? i.e. % pacbio reads corrected, % excluded, % of initial read length corrected, etc..
Thanks,
Colin
Hey Colin,
We've actually had a different experience with LoRDEC thus far, namely it seems to report almost all of the reads as you'll see in the statistics below. Here is a table of number of PacBio reads in the output for all of our datasets:
Dataset | Raw | LoRDEC | FMLRC |
---|---|---|---|
E. coli k12 | 82783 | 82663 | 82783 |
S. cerevisiae W303 | 216806 | 216490 | 216806 |
P. falciparum 3d7 | 242715 | 242711 | 242715 |
A. thaliana Ler-0 | 3758273 | 3758272 | 3758273 |
FMLRC will always output a read, and LoRDEC almost always outputs the reads for these datasets.
We don't have any direct statistics on the % corrected or % of initial corrected, but we do have N50s of the PacBio reads for three of the datasets:
Dataset | Raw | LoRDEC | FMLRC |
---|---|---|---|
E. coli k12 | 7566 | 7285 | 7064 |
S. cerevisiae W303 | 8975 | 8792 | 8689 |
A. thaliana Ler-0 | 7205 | 7050 | 6867 |
The P. falciparum 3d7 dataset is excluded because the dataset is incredibly noisy, and we weren't able to assemble anything from its PacBio reads (raw or corrected with any method). While I don't have exact statistics for you, I recall from preliminary tests that LoRDEC and FMLRC were each able to make corrections to less than 40% of the PacBio reads for this dataset (I believe it was actually much lower than that, but I don't want to misreport on those stats). Despite this, LoRDEC is outputting almost all of the reads, and FMLRC always outputs all reads, but there is no guarantee that either tool was able to correct all of the outputted reads. This is especially true when there is a large amount of noise as in our P. falciparum dataset.
Does this help answer your questions?
Matt
Hi Matt,
thanks, that's a great answer.
I find this quite surprising, but guess it's down to the massive difference in repeat content between bacteria, small repeat poor plants like Arabidopsis, and the large highly repeat rich plant genomes we are working on.
I am not surprised P. falciparum was noisy, as the GC content is incredibly low.
I will see if I can find time and server resources to try FMLRC on a subset of Pacbio reads and report back.
Thanks,
Colin
closing due to inactivity