muellan/metacache

long reads

xgnusr opened this issue · 4 comments

can the use of long genomic reads improve the accuracy ?

tnx !!

In theory, the longer the reads are, the more accurate the results should become. This is certainly what we have seen for Illumina reads of different lengths.

The most crucial setting is -winlen, the "window length" which has a default of 128. This has the effect that even reads shorter than 128bp can be mapped accurately and the speed is still ok even if reads are a couple of times longer than that.

For very long reads (1000s or 10,000s of bp) the whole pipeline will still work, but mapping speed will decrease. So it could be beneficial to set -winlen to a (much) larger value when building a database for use with such reads. This will not only increase the mapping speed but also decrease the database's memory footprint (on disk and in RAM). This could however lead to lower accuracy, especially if the error rate of the reads is very high.

Since longer reads have usually more errors than Illumina reads, I would probably set -winlen so that even the shortest reads are devided into multiple windows. Something along the lines of -winlen 512 if your shortest reads are 1000bp long.

If you had calibration reads with ground truth data you could use those to optimize for the best tradeoff between speed+memory consumption and accuracy. If you are just interested in accuracy and speed and memory consumption don't matter, just stick to the default settings.

I am interested in the precision for the strain level ....
What values would be the most appropriate?
my N50 is 3500- 4000 bp

thank you !!

If you want to distinguish reads on strain level then mapping accuracy is of course extremely important.

I would first try the default settings which promise the highest accuracy and see if you can live with the mapping speed and memory consumption. As an added bonus you can process shorter reads with the same database without having to change anything.

If however, you plan to process many more such datasets in the future, it might be worth optimizing the speed.
I would try -winlen 1024 and look at the fraction of unmapped reads and reads mapped to the lower taxonomic levels as compared to the default setting of 128. In case they are roughly the same I would double the -winlen and continue until the results get worse. If 1024 is already too bad when compared to 128 I would try 512 and 256.

Just to give a little update on this issue. I've recently processed Oxford Nanopore datasets with varying read lengths of 100 up to 21000. The N50 was relatively low (in the 500s). Turned out that MetaCache's default settings yielded the best results.