jtamames/SqueezeMeta

Contradictory results after sequential and coassembly modes comparison

Closed this issue · 2 comments

Dear SqueezeMeta developers,

Here I am again to ask something that has nothing to do with a SQM technical issue. To put you in the picture, the results I got after launching the sequential and coassembly modes were similar but different in some aspects. In more details, on one hand I used three similar samples to launch the coassembly mode, and on the other hand I launched the sequential mode using the same three samples along with other two different samples. Comparing the KEGG pathways obtained from both modes, I realized that there were inconsistency in presence/absence of KO functions. I guess this happened because SQM uses the reads from the metagenomes to assembly all together and, after that, assign a functional profile to the sequences, whereas that in sequential mode SQM uses exclusively the reads from each dataset independently. Is that right?

In attached pics you can see an example of this (one with three samples used for the coassembly mode, and other with five samples from the sequential mode). Colors green, red and yellow belong to the same samples in both pics.

So, what are the most reliable results?
ko00910 pathview multi
ko00910 pathview multi_Arroz

Indeed, with coassembly a single assembly is generated by pooling all the reads from all the samples, then it is annotated, and then the reads from the different samples are mapped to it in order to estimate the abundance of features (contigs/orfs/functions/taxa..) in each sample. While in the sequential mode assembly/annotation/mapping is done independently for each metagenome.

The primary difference is that in a coassembly you have a much larger depth since you are pooling reads from all the sample. So e.g. an organism that is at low abundance in your samples may not get assembled from just one metagenome (there are not enough reads!) but may generate contigs from a coassembly (since now we have much more reads available).

This is generally what you see here (from a quick glance, but correct me if I'm wrong), there is no function that was retrieved in the individual assemblies for those three samples and not in the coassemblies. Some new functions are detected in the coassembly (which makes sense) such as Hao or 1.7.2.6, but you can see how they are not very abundant. 1.7.7.2 is interesting because it was not assembled in the individual assemble from "yellow" sample, but had actually good abundance there (as shown in the coassembly picture). Maybe this has to do with some behaviour of the assembler (like e.g. splitting that gene into two different contigs so it was not detected, or discarding it due to some other reason...) but in general the results seem to make sense.

That's right! Thank you for your support.