Recently NCBI released new BLAST databases containing NCBI-selected Reference and Representative Genome nucleotide assemblies. This new database promises to provide cleaner results and a smaller overall database than our current references.
This is how NCBI selects representative genomes:
Representative genomes are selected based on the following rules:
- the only genome sequenced
- the only complete genome for species
- multiple complete genomes or no complete genome
- Type strain
- First complete genome sequenced
- Human Microbiome reference
- Highest assembly quality
build_NCBI_rep_genomes.sh
is a script for automatically downloading and
building the databases.
Usage:
. scripts/build_NCBI_rep_genomes.sh [build_id]
Where build_id
is a build identifier. A new directory will be created within
NCBI_rep_genomes
with the build ID. For example, the identifier used for
BLAST databases dated Oct. 27, 2016 was 20161027
.
build date | build ID | latest |
---|---|---|
Feb. 2, 2019 | 20190220 | yes |
Oct. 27, 2016 | 20161027 | no |
The 16S microbial databases contain Bacterial and Archaeal 16S rRNA sequences from the NCBI RefSeq targeted loci project (BioProject 33175).
A brief description:
The 16S ribosomal RNA targeted loci project is the result of an international collaboration between a number of ribosomal RNA databases and NCBI to provide a curated and comprehensive set of complete and near full length Reference Sequence records for phylogenetic and evolutionary analyses. Sequences that represent the consensus of all contributing databases in both sequence content and taxonomic assignment are promoted to RefSeqs. All sequences will have the same project ID and can be found as such.
build_NCBI_16SMicrobial.sh
is a script for automatically downloading and
building the databases.
Usage:
. scripts/build_NCBI_16SMicrobial.sh [build_id]
Where build_id
is a build identifier. A new directory will be created within
NCBI_16SMicrobial
with the build ID.
build date | build ID | latest |
---|---|---|
Aug. 6, 2016 | 20160806 | yes |
Contains protozoa representative genomes.
build_NCBI_rep_protozoan.sh
is a script for automatically downloading and
building the databases.
Usage:
. scripts/build_NCBI_rep_protozoan.sh [build_id]
Where build_id
is a build identifier. A new directory will be created within
NCBI_rep_protozoa
with the build ID.
build date | build ID | latest |
---|---|---|
Nov. 28, 2017 | 20171128 | yes |
Contains fungal representative genomes.
build_NCBI_rep_fungi.sh
is a script for automatically downloading and
building the databases.
Usage:
. scripts/build_NCBI_rep_fungi.sh [build_id]
Where build_id
is a build identifier. A new directory will be created within
NCBI_rep_fungi
with the build ID.
build date | build ID | latest |
---|---|---|
Nov. 28, 2017 | 20171128 | yes |
Contains a parasitic genomes database from wormbase.
build_wormbase_parasite.sh
is a script for automatically downloading and
building the databases.
Usage:
. scripts/build_wormbase_parasite.sh [build_id]
Where build_id
is a build identifier. A new directory will be created within
wormbase_parasite
with the build ID.
build date | build ID | latest | comment |
---|---|---|---|
Nov. 28, 2017 | 20171128 | no | WBPS9 |
Mar. 6, 2019 | 20190306 | yes | WBPS12 |
The plasma database contains reference sequences for organisms that may appear in human plasma.
Usage:
. scripts/build_plasmaDB.sh [build_id]
Where build_id
is a build identifier. Before building, the build directory should contain files with information about the sequences to be included in the database. These can be:
- Accessions: Each file contains a list of accession numbers, one per line. The filename for these files should begin with
acc.*
. The name of the resulting database will be the file name with the beginning and extension removed. For example, if provided the fileacc.human_viruses.txt
, the build script will create a database namedhuman_viruses
. - Sequences: Each file is a multi-fasta files from the same taxonomic group. The filename should have the extension
*.fasta
and should begin with the taxonomy ID to be added to the sequence names. For example, if provided the file11103.HCV_references.fasta
, the build script will add the taxon ID11103
to all sequences and create a database namedHCV_references
.
build date | build ID | latest | comment |
---|---|---|---|
Aug. 6, 2016 | 20160806 | no | |
Jan. 17, 2017 | 20170117 | yes | separate DB for HCV, HIV, and human viruses |
Databases built for kraken, see manual here.
Usage:
Full kraken databases are built using scripts/build_kraken.sh
:
. scripts/build_kraken.sh [build_id]
This will create several sbatch
jobs that will download and build the databases.
We have also downloaded MiniKrakenDB, which is a pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq (as of Dec. 8, 2014).
wget http://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
tar xzf minikraken.tgz
build date | build ID | latest | comment |
---|---|---|---|
Jun. 14, 2017 | 20170614 | yes | See below for available databases |
The Jun. 14, 2017 build contains the following databases:
Contents | Name | Size (of *.kdb) |
---|---|---|
MiniKrakenDB (Bacteria, Archaea, Viruses) | minikraken_20141208 | 3.4 GB |
Bacteria, Archaea | p | 66 GB |
Viruses | v | 1.6 GB |
Human | h | 28 GB |
Bacteria, Archaea, Viruses | p+v | 68 GB |
Bacteria, Archaea, Viruses, Human | p+h+v | 97 GB |
This database is for centrifuge and was downloaded from CCB
mkdir -p [build_id] && cd [build_id]
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed.tar.gz
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/nt.tar.gz
tar xzf p_compressed.tar.gz
tar xzf p_compressed+h+v.tar.gz
tar xzf p+h+v.tar.gz
tar xzf nt.tar.gz
build date | build ID | latest | comment |
---|---|---|---|
Dec. 6, 2016 | 20161206 | yes | See below for available databases |
There are 4 databases included with the Dec. 6, 2016 build:
Contents | Name | Size |
---|---|---|
Bacteria, Archaea (compressed) | p_compressed | 4.4 GB |
Bacteria, Archaea, Viruses, Human (compressed) | p_compressed+h+v | 5.4 GB |
Bacteria, Archaea, Viruses, Human | p+h+v | 7.9 GB |
NCBI nucleotide non-redundant sequences | nt | 50GB |
Contains sequences for plant markers: trnH-psbA, ITS2, and rbcL.
PLEASE NOTE:
This does not download from an FTP site, because there wasn't one.
Information regarding where each marker came from is in the script build_plant_markers.sh
.
Usage:
. scripts/build_plant_markers.sh [build_id]
Where build_id
is a build identifier. A new directory will be created within
plant_markers
with the build ID.
build date | build ID | latest | comment |
---|---|---|---|
Nov. 29, 2017 | 20171129 | yes | See below for available databases |
The Nov. 29, 2017 build contains the following databases:
Contents | Name | Size (of *.kdb) |
---|---|---|
trnH-psbA | trnH_psbA_marker_seqs | 11 MB |
ITS2 | ITS2_marker_seqs | 1.9 GB |
rbcL | rbcL_marker_seqs | 409 MB |