Authors:
Peter Menzel pmenzel@gmail.com
Anders Krogh krogh@binf.ku.dk
Kaiju is a program for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from microbial and viral genomes.
The program is described in Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).
Kaiju can be installed locally (see below) or used via a web server.
See the release notes for all releases here.
Copyright (c) 2015-2018 Peter Menzel and Anders Krogh
Kaiju is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Kaiju is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the file LICENSE for more details.
You should have received a copy of the GNU General Public License along with the source code. If not, see http://www.gnu.org/licenses/.
Kaiju can be downloaded directly from GitHub either as a compressed archive or using the git command line client:
git clone https://github.com/bioinformatics-centre/kaiju.git
This will create the directory kaiju
in the current directory.
Kaiju is written in C/C++11 for Linux and does not depend on additional libraries. For compiling Kaiju and its associated programs, type:
cd kaiju/src
make
Afterwards, Kaiju's executable files are available in the kaiju/bin
directory.
You can add this directory to your shell's $PATH
variable or copy the files to a directory in your PATH.
Before classification of reads, Kaiju's database index needs to be built from the reference protein database. You can either create a local index based on the currently available data from GenBank, or download one of the indexes used by the Kaiju web server.
For creating a local index, the program makeDB.sh
in the bin/
directory
will download the reference genomes and taxonomy files from the NCBI FTP server,
convert them into a protein database and construct Kaiju's index (the
Burrows-Wheeler transform and the FM-index) in one go.
It is recommended to create a new directory for downloading the files and database construction, for example:
mkdir kaijudb
cd kaijudb
makeDB.sh [-r|-p|-n|-e]
The downloaded files are several GB in size. Therefore, the program should be run in a directory having at least 80 GB of free space.
There are several options for creating the reference database with protein sequences from different source databases:
makeDB.sh -r
Download only completely assembled and annotated reference
genomes of Archaea and Bacteria from the NCBI RefSeq database.
Additionally, viral genomes from NCBI RefSeq can be added by using the option -v
.
As of February 2018, this database contains ca. 33M protein sequences, which amounts to a requirement of 21GB RAM for running Kaiju.
makeDB.sh -p
Download the protein sequences belonging to the representative set of genomes
from the proGenomes database.
This dataset generally covers a broader phylogenetic range compared to the RefSeq dataset,
and is therefore recommended, especially for environmental samples.
Additionally, viral genomes from NCBI RefSeq can be added by using the option -v
.
As of February 2018, this database contains ca. 19M protein sequences, which amounts to a requirement of 13GB RAM for running Kaiju.
makeDB.sh -v
This will only download viral genomes from NCBI RefSeq.
As of February 2018, this database contains ca. 310k protein sequences, which amounts to a requirement of 260MB RAM for running Kaiju.
makeDB.sh -n
Download the nr database that is used by NCBI BLAST and extract proteins belonging
to Archaea, Bacteria and Viruses.
makeDB.sh -e
Download the nr database as above, but additionally include proteins from fungi and microbial eukaryotes.
The complete taxon list for this option is in the file bin/taxonlist.tsv
.
Because the nr database contains more proteins, more RAM is needed for index
construction and for running Kaiju. As of February 2018, the nr database with
option -e
contains ca. 124M protein sequences, which amounts to a requirement
of 68GB RAM for running Kaiju.
makeDB.sh -m
Download the protein sequences belonging to the genomes
from the MarRef and MarDB databases.
This dataset specifically covers marine metagenomics.
When using option -r
, makeDB.sh
downloads and extracts 5 genomes from the NCBI FTP
server in parallel. This number can be changed by modifying the appropriate
variables at the beginning of the script.
By default, makeDB.sh
uses 5 parallel threads for constructing the index, which can
be changed by using the option -t
. Note that a higher number of threads
increases the memory usage during index construction, while reducing the number
of threads decreases memory usage.
After makeDB.sh
is finished, only the files kaiju_db.fmi
(or kaiju_db_nr.fmi
/ kaiju_db_nr_euk.fmi
), nodes.dmp
,
and names.dmp
are needed to run Kaiju. The remaining files and the genomes/
directory containing the downloaded genomes can be deleted.
It is also possible to make a custom database from a collection of protein sequences. The format needs to be a FASTA file in which the headers are the numeric NCBI taxon identifiers of the protein sequences, which can optionally be prefixed by another identifier (e.g. a counter) followed by an underscore, for example:
>1_1358
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQN
>2_44689
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ
>3_352472
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>4_91061
MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR
...
The taxon identifiers must be contained in the NCBI taxonomy files nodes.dmp and names.dmp.
Then, Kaiju's index is created using the programs mkbwt
and mkfmi
. For example, if the database FASTA file is called proteins.faa
, then run:
mkbwt -n 5 -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa
mkfmi proteins
which creates the file proteins.fmi that is used by Kaiju. Note that the protein sequences may only contain the uppercase characters of the standard 20 amino acids, all other characters need to be removed.
Kaiju requires at least three arguments:
kaiju -t nodes.dmp -f kaiju_db.fmi -i inputfile.fastq
If you chose options -n
or -e
in makeDB.sh
, then use -f kaiju_db_nr.fmi
or -f kaiju_db_nr_euk.fmi
.
For paired-end reads use -i firstfile.fastq
and -j secondfile.fastq
.
The reads must be in the same order in both files. Kaiju will strip suffixes
from the read names by deleting all characters after a /
or space. The read
names are then compared between the first and second file and an error is
issued if they are not identical.
Kaiju can read input files in FASTQ and FASTA format, which may also be gzip-compressed.
By default, Kaiju will print the output to the terminal (STDOUT).
The output can also be written to a file using the -o
option:
kaiju -t nodes.dmp -f kaiju_db.fmi -i inputfile.fastq -o kaiju.out
Kaiju can use multiple parallel threads, which can be specified with the -z
option, e.g. for using 25 parallel threads:
kaiju -z 25 -t nodes.dmp -f kaiju_db.fmi -i inputfile.fastq -o kaiju.out
The default run mode is MEM, which only considers exact matches.
For using the Greedy mode, which allows mismatches, set the mode via the option -a
and the number
of allowed substitutions using option -e
:
kaiju -t nodes.dmp -f kaiju_db.fmi -i inputfile.fastq -a greedy -e 5
The cutoffs for minimum required match length and match score can be changed using the options -m
(default: 11) and -s
(default: 65).
In Greedy mode, the option -E
can be used to filter matches by E-value, similar to blastp.
For example, a cutoff of 0.05 can be set by:
kaiju -t nodes.dmp -f kaiju_db.fmi -i inputfile.fastq -a greedy -e 5 -E 0.05
NB: The thresholds for minimum match length and score are still applied.
If the input sequences are already protein sequences, use option -p
to disable translation of the input.
Option -x
enables filtering of query sequences containing
low-complexity regions by using the SEG algorithm from the blast+ package.
It is enabled by default and can be disabled by the -X
option. SEG filtering
is always recommended in order to avoid false positive taxon assignments that
are caused by spurious matches due to simple repeat patterns or other
sequencing noise.
Kaiju will print one line for each read or read pair.
The default output format contains three columns separated by tabs.
Using the option -v
enables the verbose output, which will print additional columns:
- either C or U, indicating whether the read is classified or unclassified.
- name of the read
- NCBI taxon identifier of the assigned taxon
- the length or score of the best match used for classification
- the taxon identifiers of all database sequences with the best match
- the accession numbers of all database sequences with the best match
- matching fragment sequence(s)
NB: Since the nr database aggregates multiple genes of identical sequences, only the first accession number for each sequence in the database is kept in Kaiju's database and therefore also in the output file.
The number of taxon identifiers (column 5) and accession numbers (column 5) is limited to 20 entries each in order to reduce large outputs produced by highly abundant protein sequences in nr, e.g. from HIV.
The accuracy of the classification depends both on the choice of the reference database and the chosen options when running Kaiju. These choices also affect the speed and memory usage of Kaiju.
For highest sensitivity, it is recommended to use the nr database (+eukaryotes) as a reference database because it is the most comprehensive set of protein sequences. Alternatively, use proGenomes over Refseq for increased sensitivity.
Greedy run mode yields a higher sensitivity compared with MEM mode.
For fastest classification, use MEM mode and multiple parallel threads
(-z
); and for lowest memory usage use the proGenomes reference
database. The number of parallel threads has only little impact on memory usage.
Further, the choice of the minimum required match length (-m
) in MEM mode or
match score (-s
) in Greedy mode governs the trade-off between sensitivity and
precision of the classification. Please refer to the paper for a discussion on
this topic.
The program kaiju2krona
can be used to convert Kaiju's tab-separated output file
into a tab-separated text file, which can be imported into Krona. It requires the nodes.dmp
and names.dmp
files from the NCBI taxonomy for mapping the taxon identifiers from Kaiju's
output to the corresponding taxon names.
kaiju2krona -t nodes.dmp -n names.dmp -i kaiju.out -o kaiju.out.krona
The file kaiju.out.krona
can then be imported into Krona and converted into an HTML file using
Krona's ktImportText
program:
ktImportText -o kaiju.out.html kaiju.out.krona
The program kaijuReport
can convert Kaiju's tab-separated output file into a
summary report file for a given taxonomic rank, e.g., genus. It requires the
nodes.dmp
and names.dmp
files for mapping the taxon identifiers from
Kaiju's output to the corresponding taxon names.
kaijuReport -t nodes.dmp -n names.dmp -i kaiju.out -r genus -o kaiju.out.summary
The program can also filter out taxa with low abundances, e.g. for only showing genera that comprise at least 1 percent of the total reads:
kaijuReport -t nodes.dmp -n names.dmp -i kaiju.out -r genus -m 1 -o kaiju.out.summary
or for showing genera comprising at least 1 percent of all classified reads:
kaijuReport -t nodes.dmp -n names.dmp -i kaiju.out -r genus -m 1 -u -o kaiju.out.summary
Option -p
will print the full taxon path instead of just the taxon name.
The program addTaxonNames
appends the name that corresponds to the taxon id in
Kaiju's output file as a last column to the output.
addTaxonNames -t nodes.dmp -n names.dmp -i kaiju.out -o kaiju.names.out
Option -u
will omit unclassified reads.
Option -p
will print the full taxon path instead of just the taxon name.
Option -r
will print the path containing only to the specified ranks. For example,
-r phylum,genus
will append the names of phylum and genus to the end of each line.
The program mergeOutputs
can merge two tab-separated output files in the
column format (see above) used by Kaiju and Kraken. Only the first three columns are used.
The files need to be sorted by the read name in the second column, for example by:
sort -k2,2 kaiju.out >kaiju.out.sort
sort -k2,2 kraken.out >kraken.out.sort
Then both files can be merged:
mergeOutputs -i kaiju.out.sort -j kraken.out.sort -o combined.out -v
The shell's process substitution can be used for sorting without creating intermediate files:
mergeOutputs -i <(sort -k2,2 kaiju.out) -j <(sort -k2,2 kraken.out) -o combined.out -v
The output file will be in the same column format as the input files (but only
contain the first three columns) and it will have the same length as the input
files (which also have to be of same length). In the case of conflicting taxon
identifiers for a classified read in both input files, mergeOutputs
will use the identifier found in the
first input file (specified by -i
). This behavior can be changed using the
-c
option, which can take four possible values:
1
: use taxon identifier from the first input file (default)2
: use taxon identifier from the second input filelca
: use the least common ancestor of the taxon identifiers from both files.lowest
: use the lowest ranking of the two taxon identifiers iff they are within the same lineage. Otherwise use the LCA.
Options lca
and lowest
require the path to the file nodes.dmp
by using the -t
option.
The programs kaijux
and kaijup
can be used for finding the best matching
database sequence for each query sequence without taxonomic classification,
i.e., they will just print the name of the database sequence. Thus, both
programs do not use the nodes.dmp
file containing the taxonomy, but only need
the .fmi
database file. While kaijux
takes nucleotide sequences as input
and translates them into the six reading frames like standard kaiju
,
kaijup
takes protein sequences as input, which are directly searched in the
database. All other parameters remain the same as in standard kaiju
. In case
of paired-end reads, both mates are searched independently.
To build an index for a custom database, all sequences need to be in a single
FASTA file and may only contain the 20 letters from the standard protein
alphabet ACDEFGHIKLMNPQRSTVWY
.
For example, building the index (the Burrows-Wheeler transform and FM-index) from the
file with the protein sequences proteins.faa
is done in two steps by the
programs mkbwt
and mkfmi
:
mkbwt -n 5 -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa
mkfmi proteins
This will create two intermediate files proteins.bwt
and proteins.sa
, and finally
the file proteins.fmi
, which is used by Kaiju.
The option -n
for mkbwt
specifies the number of parallel threads. The more
threads are used, the higher the memory consumption becomes. The option -e
for mkbwt
specifies the exponent of the suffix array checkpoint distances and
therefore determines the trade-off between the size of the suffix array and the
speed of the search. The default value is 5.