mash_db
lfenske-93 opened this issue · 4 comments
Hi,
I try to use the new version of GTDBtk for my workflow but got a little bit stuck:
gtdbtk classify_wf: error: one of the arguments --skip_ani_screen --mash_db is required
Do I need to download a specific mash_db or where do I find it?
Greetings,
Linda
It will be written if it doesn't exist, and reused in subsequent calls.
I moved it to the (read-only in our setup) GTDBTK_DATA_PATH
after first creation, as presumably it won't change unless the database changes.
@aaronmussig wouldn't it be better to add this as a standard step in the database installation procedure, and do away with the --mash_db
option altogether? It is somewhat confusing.
Hello,
Although it is possible to set the mash_db
as a precompiled msh file and add it as part of the standard Tk database installation procedure, we did opt to leave this solution for few reasons:
- We recommend using the latest version of mash ( currently
2.2.2
), but we are not sure a sketch file generated with2.2.2
will be compatible with mash2.2.3
. There may be backward incompatibility. - Different users may want different Mash databases ( with different Kmer size, number of non-redundant hashes, maximum p-value to keep ) so a generic sketch file may not fit everyone’s use of the database.
- Having one user generating the sketch file may cause read/write conflicts when the GTDB-Tk database is placed in a shared environment.
Ideally there should be a default path or highlight this (braking) change to the CLI in the command line.
It will be written if it doesn't exist, and reused in subsequent calls.
I was just reading the docs on --mash_db
, and it seems that this is currently not very clear. I had to go searching into the issues to find this issue in order to understand --mash_db
. The example at https://ecogenomics.github.io/GTDBTk/commands/classify_wf.html does not include the required parameter (--mash_db
or --skip_ani_screen
):
gtdbtk classify_wf --genome_dir <my_genomes> --out_dir <output_dir>
... or later in those docs:
gtdbtk classify_wf --genome_dir genomes/ --out_dir classify_wf_out --cpus 3