Interpretation of results
michoug opened this issue · 3 comments
Hi,
I tried your tool on one of my datasets where I got viral contigs with Vibrant then I ran VRhyme and compared the results obtained after the dereplication part before or after generating vMAGs.
Here are before generating vMAGs
checkv_quality n mean sum max
Complete 557 46179.5 25721993 373392
High-quality 413 44008.8 18175622 275626
Here is after
checkv_quality n mean sum max
Complete 437 48556.5 21219180 373392
High-quality 514 46641.6 23973794 387939
Checked the quality with checkV and only selected best quality "viruses"
Where mean is the mean of contig length, sum is the total length of all contigs and max is the maximum size of the biggest “virus”.
So the average length is higher but the "contamination" is also higher?
Any input on these results?
Best
Greg
Hi,
I have a couple questions before I can give more thoughts on this.
- Are you estimating higher contamination due to the drop in complete genomes after binning, assuming the drop is due to complete genomes being incorrectly binned with other scaffolds?
- Are you running checkV after binning on both the bins and unbinned contigs?
- Is this an aggregation of data from multiple samples binned, or binning multiple samples combined at one time? 557 complete genomes is a lot to get from one sample.
Hi,
Out of the 557 "complete" contigs, 137 were clustered with others in vRhyme.
Yes, I'm running checkV after binning on both the bins and unbinned contigs.
And yes, it's an aggregation of data from multiple samples binned
Greg
Are these complex virome samples? I'm currently working on an updated v1.1.0 that should address some of these issues. In addition to updates that improve precision, I implemented a step to remove complete (circular) sequences before binning. The update should be available within the next 1-2 weeks.