oschwengers/referenceseeker

Seemingly successful ReferenceSeeker run with no results.

Closed this issue · 8 comments

My reference seeker runs are seeming running fine as I get no error reporting but I also get no results printed to STDOUT apart from the headers:

#ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

Input:

referenceseeker --threads 5 $database $genome

This might be because there is no sufficiently-closely related reference genome available in the selected DB.

Please try executing ReferenceSeeker with the --unfiltered option. This will take a bit longer, but skips species specific thresholds for ANI and cDNA and additionally relaxes the Mash distance prefilter:
referenceseeker --unfiltered --verbose --threads 5 $database $genome

The database is constructed of same genus assemblies and each assembly also exists in the database as I am trying to cluster the set of assemblies into putative species. I will try the --unfiltered option, thanks.

This has worked. Would you advise I use EMBL over fasta for eukaryotic as ANI is meant to be based off coding regions?

I'm note quite sure if I got you right. ANI/cDNA values should not be biased towards coding regions and thus, they work perfectly fine on assembled sequences alone. So, the assembled sequences should work just fine.

However, the fact that assembled genomes (which are also stored within the database) do not appear in the results is rather odd. Could you maybe provide me one of these assemblies to have a deeper look on those?

Ah okay I misunderstood. I thought ANI was based on coding regions only.

Here is one of the sequences:
IC0026.fasta

Here is the output generated:


ReferenceSeeker v1.7.3
Options, parameters and arguments:
        db path: /home/projects/ku_00014/data/termitomyces/referenceSeeker_db/termitomyces
        genome path: /home/projects/ku_00014/data/termitomyces/sequences/IC0026.fasta
        tmp path: /scratch/31997292/tmpphzmkwcx
        unfiltered: True
        bidirectional: False
        ANI: 0.95
        conserved DNA: 0.69
        # CRG: 100
        # threads: 5

Estimate genome distances...
        screened 14 potential reference genome(s)

Compute ANIs...

#ID     Mash Distance   ANI     Con. DNA        Taxonomy ID     Assembly Status Organism
D45_scaffolds   0.05935 93.54   39.68   12908   contig  NA
GCA_003313785   0.10015 90.87   18.39   12908   contig  NA
GCA_003313075   0.17109 86.16   0.57    12908   contig  NA
GCA_003316525   0.16443 86.19   0.55    12908   contig  NA
D43_scaffolds   0.21019 85.04   0.26    12908   contig  NA
D58_scaffolds   0.20120 84.40   0.18    12908   contig  NA
D46_scaffolds   0.17768 84.39   0.11    12908   contig  NA
D42_scaffolds   0.18855 84.75   0.11    12908   contig  NA
D36_scaffolds   0.19126 84.04   0.09    12908   contig  NA
D53_scaffolds   0.19753 84.83   0.03    12908   contig  NA
D12_T123_scaffolds      0.20537 83.96   0.03    12908   contig  NA
D52_scaffolds   0.20120 83.91   0.02    12908   contig  NA
D13_T114_scaffolds      0.20120 83.94   0.01    12908   contig  NA
D59_scaffolds   0.22285 83.50   0.01    12908   contig  NA

Seems like most of the hits are fairly distant species and not closely related. The best hit D45_scaffolds has a fairly high ANI but a rather low conserved DNA. I have not much experience with fungi, but I'd certainly not consider them as the same species.

Regarding the missing self hit of IC0026.fasta:
I've just run some tests with it setting up a new databse only containing this genome. Based on this data, I cannot reproduce your empty results:

$ referenceseeker_db init --db fungi-test
created database directory
created database genome information file (db.tsv)
created database genome kmer sketch file (db.msh)
Successfully initialized empty database at /home/oliver/tmp/rs-gh-issue-21/fungi-test

$ referenceseeker_db import --db fungi-test/ --genome IC0026.fasta --organism "Foo barus strain baz"
Successfully imported genome (Foo barus strain baz/IC0026.fasta) into database (/home/oliver/tmp/rs-gh-issue-21/fungi-test)

$ referenceseeker fungi-test/ IC0026.fasta 
#ID	Mash Distance	ANI	Con. DNA	Taxonomy ID	Assembly Status	Organism
scaffold00001_RagTag	0.00000	99.99	98.10	12908	contig	Foo barus strain baz

Intrestingly all these are currently considerd the same species, but there is mounting evidence they are crazy divergent. I will try remake the database and see if I can get same species matching.

I would imagine the conserved DNA could be quite drastically effected by assembly quallity? Some of them are not exactly great.

Assembly quality will of course have an impact to some extent.

In bacteria, coding density is quite high and there's a fairly good consensus on species limits (~95% ANI, 69% conserved DNA). Unfortunately, I don not know if there is a likewise acceptable threshold for fungi and if so, what that might look like.

I'll close this for now as I think that the core issue is resolved. Please, do not hesitate to re-open it if required.