broadinstitute/StrainGE

Can we run on non complete genomes

Closed this issue · 1 comments

Hello I followed the tutorial for strainGE

Is it possible to run it on non-complete genomes or even MAGs? I think the mapping should be ok, however at the analysis step ion python, I get problems as the genome has multiple contigs.

Can I simply aggregate the values per contig. e.g. sum the identical and total nucleotides?

In principle, it should work with drafty assemblies. In our experience, however, plasmids make everything harder. In drafty assemblies it's harder to detect which contigs belong to a plasmid. It may throw off StrainGST, e.g., reporting a reference only because a good chunk of a plasmid that happened to be present in that reference was detected. It also may throw off ANI estimates with StrainGR because plasmids tend to be more diverse, thus making it harder to define a "same strain" threshold. Even with this in mind, you could aggregate and do a weighted average (weighted by scaffold length) of the key metrics you're interested in.