leylabmpi/Struo2

Guidance on NCBI custom genomes download

NienkeMekkes opened this issue · 4 comments

Dear authors,

Thank you for developing this tool. I have the following question:

I want to build a db based on NCBI genomes. For example, I want to download the refseq genomes for 100 different species, and then use struo2 to build a kraken2 db. For the script genome_download.R, what exactly would be the input for this script? A table with one column of all the ncbi assembly accessions belonging to my 100 species of interest?

Any reason why you want to use refseq instead of GTDB? Most of the genomes in refseq should be in GTDB, and the GTDB provides a lot of metadata for each genome assembly (eg., checkm quality estimations and taxonomies inferred by many approaches).

Thank you for your fast reply. The main reason is that I have no experience working with GTDB, and a lot more experience working with refseq downloads. But I see your point with regards to metadata! With GTDB, is it also possible to use my list of 100 species of interest to easily extract the genomes of these 100 species from GTDB?

With GTDB, is it also possible to use my list of 100 species of interest to easily extract the genomes of these 100 species from GTDB?

Yes! The GTDB metadata includes species, genus, etc. info for many taxonomies, so you can easily select genomes based on taxonomy

Perfect, thanks for the lightning fast replies, I'm closing the issue for now