oschwengers/referenceseeker

[Question] Using ReferenceSeeker to cluster draft fungal genomes

Rob-murphys opened this issue · 10 comments

I am wanting to use ReferenceSeeker to cluster some draft reference genomes of a poorly defined and highly divergent genus into putative species.

To do this I have generated a database will all known genomes of this genus and then run all said genomes through referenceseeker.

Given the are draft genomes would it be best to take a subset of only homologous region. For example using BUSCO to identify common genes and then only using these genes in ReferenceSeeker?

Or would using a cutoff of Con. DNA effectively do a similar thing? My worry would be perhaps the closest genomes has is a bad draft genome so it lost through the Con. DNA cut off and using the above mentioned homologous region only approach would remedy this?

Secondly, given they are draft genomes, would this alter my interpretation of the output?

I want to build a network and I was planning on using Mash distance as the edges to this network where Con. DNA is above a certain threshold. I am unsure how ANI could be incorporated into this.

Lastly, I see many of the genomes are not returning themselves as the top match in the database. Would this be because I am using --unfiltered?

Does this seem valid?

Regarding your 1. question:
You can do this without restricting your genomes to homologous regions. Actually, the conserved DNA value does express the fraction of mutually common genome regions. The initial introduction of ANI values correlated DNA-DNA-hybridization values to ~95% ANI under the restriction that the conserved DNA is >= 69%. Subsequent broader genome clusterings proved a species boundary of 95% ANI (+- 2% dependent on the species). However, these values are valid for bacteria. Unfortunately, I do not have any experience or literature regarding the species boundaries for fungi.

Your 2. question:
To calculate ANI values, the input genome is split into DNA subsequences of ~ 1 kbp. Hence, draft assemblies should in theory not have a larger impact on ANI and conserved DNA values.

Last question: this indeed is unexpected. I have to look into this in more detail to answer your question appropriately.

@Lamm-a gentle ping
Is this still active? Otherwise, I'd close this for now.

@oschwengers I think it is safe to close as without knowing the species boundaries for fungi I guess I can't do what I intended to do with anything other than mash distances. I still think I will try with ANI and see what we get. However the divergence you brought up in my other post presents an issue for the conserved DNA values I think. However ifI am understanding the two are quite intertwined so we are unsure in using ANI is valid for this.

I think ANI values are in general a much more thorough comparison metric than Mash distances due to the fact, that by using Mash distances you normally only compare a tiny fraction of the genome. In contrast to this, by using ANI and conserved DNA metrics, you compare the entirety of the common, i.e. alignable, fraction of a genome pair. Hence, even though one might not know the particular ANI thresholds for fungi, I nevertheless think ANI and conserved DNA are the much better metrics for this sort of comparison. There are a couple of papers using ANI for taxonomic analysis in fungi, however there are rather limited due to the low number of available genomes.

As an alternative you could also look for percentage of conserved protein (POCP).
Best regards!

I though mash distance was kmers across the whole genome to draw and compare sketches? I will defiantly look more into using ANI and conserved DNA metrics though.

Given that I am clustering I don't think knowing the thresholds is much of an issue to be honesty.

Is there a way to make reference seeker output ALL mash ANI and Conserved DNA values for every genome in a dataset?

Not by default. Mash only takes into account s kmers of a given length k, both can be specified as parameters with default values set to 1,000 and 21, respectively. So by sticking to these default parameters, you only take into account ~21,000 bp (1,000 * 21) which is a fairly small fraction of a bacterial or fungi genome.
So if you use Mash, make sure to adjust these values. In contrast, ANI takes into account all common sequences of a give genome pair.

Ahh okay I didn't know that thank you! I will use both and compare the clustering :) But how can I ensure that Referenceseeker output all ANI values for all genomes in a database?

It's not possible to force RS to calculate the input genome to all DB genomes but you can use --unfiltered to deactivate all species related thresholds (ANI=95%, conDNA=69%) and to set the Mash prefilter to 0.3 which is very conservative.

You can also specifically set many filter threshold. Please just have a look at referenceseeker --help

I am not able to see where I can change mash prefilter in referenceseeker --help. Also I assume using --unfiltered means there is no point using the --ani and --conserved-dna as they are already being ignored because of --unfiltered?

yes you're right it's either or. So you either use --unfiltered or you set --ani or --conserved-dna values.