Ecogenomics/GTDBTk

mash_db

lfenske-93 opened this issue · 4 comments

Hi,

I try to use the new version of GTDBtk for my workflow but got a little bit stuck:

gtdbtk classify_wf: error: one of the arguments --skip_ani_screen --mash_db is required

Do I need to download a specific mash_db or where do I find it?

Greetings,
Linda

zwets commented

It will be written if it doesn't exist, and reused in subsequent calls.

I moved it to the (read-only in our setup) GTDBTK_DATA_PATH after first creation, as presumably it won't change unless the database changes.

@aaronmussig wouldn't it be better to add this as a standard step in the database installation procedure, and do away with the --mash_db option altogether? It is somewhat confusing.

Hello,

Although it is possible to set the mash_db as a precompiled msh file and add it as part of the standard Tk database installation procedure, we did opt to leave this solution for few reasons:

  • We recommend using the latest version of mash ( currently 2.2.2 ), but we are not sure a sketch file generated with 2.2.2 will be compatible with mash 2.2.3. There may be backward incompatibility.
  • Different users may want different Mash databases ( with different Kmer size, number of non-redundant hashes, maximum p-value to keep ) so a generic sketch file may not fit everyone’s use of the database.
  • Having one user generating the sketch file may cause read/write conflicts when the GTDB-Tk database is placed in a shared environment.

Ideally there should be a default path or highlight this (braking) change to the CLI in the command line.

It will be written if it doesn't exist, and reused in subsequent calls.

I was just reading the docs on --mash_db, and it seems that this is currently not very clear. I had to go searching into the issues to find this issue in order to understand --mash_db. The example at https://ecogenomics.github.io/GTDBTk/commands/classify_wf.html does not include the required parameter (--mash_db or --skip_ani_screen):

gtdbtk classify_wf --genome_dir <my_genomes> --out_dir <output_dir>

... or later in those docs:

gtdbtk classify_wf --genome_dir genomes/ --out_dir classify_wf_out --cpus 3