Kraken(2) database based on the GTDB project
This page will contain links, notes and scripts for kraken databases we produced based on the GTDB database (http://gtdb.ecogenomic.org/). The kraken database contains 1 genome per taxon recognized in the GTDB database. Given that the the taxonomy of the GTDB database is based on genomic data, we expect this database to result in better performance for taxonomy-based classifier algorithms.
Notes:
- Based on GTDB Release 03-RS86 (19th August 2018)
- Not the complete GTDB database, taxa without a genus designation have been omitted (10,994 out of 11042 taxa in this version)
- Reference sequences have been selected ad hoc, without regard for the quality of the sequence. This will be dealt with in future versions
gtdbk2_bacterial_v1.tar.gz md5sum: eda7855cb38a14b4222381ac5b27fe4b
https://drive.google.com/file/d/18E0W_ezNLAhxxZjjelQYwoLA_0oBm4Lo/view?usp=sharing
Use sh scripts/google_download.sh
to download the database from your command line.
To download all fasta files and compile from source, you can run the provided script with the input file from the GTDB project. Next, use the --add-to-library
and --build
functions in Kraken to format the database. Example commands are below.
- Edirect - https://www.ncbi.nlm.nih.gov/books/NBK179288/
- Perl
- BioPerl - https://bioperl.org/INSTALL.html
- Kraken or Kraken2
# Creates the directory library/gtdb and adds fasta files.
# Also creates a taxonomy folder, compatible with Kraken.
perl scripts/gtdbToTaxonomy.pl --infile data/gtdb.2018-12-10.tsv
# Format the database.
db=GTDB_Kraken
for i in library/gtdb/*.fna; do
kraken-build --add-to-library $i --db $db
done
mv -v taxonomy $db
kraken-build --build --db $db --threads 8
You can optionally remove all intermediate fasta files and also run the clean utility in Kraken.
rm -rvf library
kraken-build --clean $db
- Error 429 too many requests. NCBI is receiving too many general requests, but you can carve out a special place for yourself by getting an NCBI API key. Log into your NCBI profile and copy your key. Then, add it to your environment like so:
export NCBI_API_KEY=1fe2...
- Henk den Bakker @hcdenbakker
- Lee Katz @lskatz