Kraken is a fast taxonomic classifier for metagenomics data. This project, kraken-hll, adds some additional functionality - most notably a unique k-mer count using the HyperLogLog algorithm. Spurious identifications due to sequence contamination in the dataset or database often leads to many reads, however they usually cover only a small portion of the genome.
KrakenHLL computes the number of unique k-mers observed for each taxon, which allows to filter more false positives. Here's a small example of a classification against a viral database with k=25. There are three species identified by just one read - Enterobacteria phage BP-4795, Salmonella phage SEN22, Sulfolobus monocaudavirus SMV1. Out of those, the identification of Salmonella phage SEN22 is the strongest, as there read was matched with 116 k-mers that are unique to the sequence, while the match to Sulfolobus monocaudavirus SMV1 is only based on a single 25-mer.
99.0958 2192 2192 255510 272869 no rank 0 unclassified
0.904159 20 0 2361 2318 no rank 1 root
0.904159 20 0 2361 2318 superkingdom 10239 Viruses
0.904159 20 0 2361 2318 no rank 35237 dsDNA viruses, no RNA stage
0.768535 17 0 2074 2063 order 548681 Herpesvirales
0.768535 17 0 2074 2063 family 10292 Herpesviridae
0.768535 17 0 2074 2063 subfamily 10374 Gammaherpesvirinae
0.768535 17 0 2074 2063 genus 10375 Lymphocryptovirus
0.768535 17 16 2001 1987 species 10376 Human gammaherpesvirus 4
0.045208 1 1 4 4 sequence 1000041143 KC207814.1 Human herpesvirus 4 strain Mutu, complete genome
0.0904159 2 0 254 254 order 28883 Caudovirales
0.045208 1 0 28 28 family 10699 Siphoviridae
0.045208 1 0 28 28 genus 186765 Lambdavirus
0.045208 1 0 28 28 no rank 335795 unclassified Lambda-like viruses
0.045208 1 1 28 28 species 196242 Enterobacteria phage BP-4795
0.045208 1 0 116 116 family 10744 Podoviridae
0.045208 1 0 116 116 no rank 196895 unclassified Podoviridae
0.045208 1 0 116 116 no rank 1758253 Escherichia phage phi191 sensu lato
0.045208 1 1 116 116 species 1647458 Salmonella phage SEN22
0.045208 1 0 1 1 no rank 51368 unclassified dsDNA viruses
0.045208 1 1 1 1 species 1351702 Sulfolobus monocaudavirus SMV1
For usage, see krakenhll --help
. Note that you can use the same database as Kraken with one difference - instead of the files DB_DIR/taxonomy/nodes.dmp
and DB_DIR/taxonomy/names.dmp
than kraken relies upon, kraken-hll
needs the file DB_DIR/taxDB
. This can be generated with the script build_taxdb
: KRAKEN_DIR/build_taxdb DB_DIR/taxonomy/names.dmp DB_DIR/taxonomy/nodes.dmp > DB_DIR/taxDB
. The code behind the taxDB is based on k-SLAM.
- Use
krakenhll --report-file FILENAME ...
to write the kraken report toFILENAME
. - Use
krakenhll --db DB1 --db DB2 --db DB3 ...
to first attempt, for each k-mer, to assign it based on DB1, then DB2, then DB3. You can use this to prefer identifications based on DB1 (e.g. human and contaminant sequences), then DB2 (e.g. completed bacterial genomes), then DB3, etc. Note that this option is incompatible withkrakenhll-build --generate-taxonomy-ids-for-sequences
since the taxDB between the databases has to be absolutely the same. - Add a suffix
.gz
to output files to generate gzipped output files
- Use
krakenhll-build --generate-taxonomy-ids-for-sequences ...
to add pseudo-taxonomy IDs for each sequence header. An example for the result using this is in the ouput above - one read has been assigned specifically toKC207814.1 Human herpesvirus 4 strain Mutu, complete genome
. seqid2taxid.map
mapping sequence IDs to taxonomy IDs does NOT parse or require>gi|
, but rather the sequence ID is the header up to just before the first space
OSX by default links g++
to clang
without OpenMP support. You can install g++
with HomeBrew and use the -c
option of krakenhll_install.sh
to specify the HomeBrew g++
:
brew install gcc
./install_krakenhll -c g++-8
Currently, KrakenHLL build depends depends on Jellyfish v1.1.11 . To install Jellfish alongside KrakenHLL, use the -j
flag for the install_krakenhll.sh
script. Alternatively, you can specify the Jellyfish path to krakenhll
with krakenhll --jellyfish-bin /usr/bin/jellyfish1
.
KrakenHLL supports building databases on subsets of the NCBI nucleotide collection nr/nt, which is most prominently the standard database for BLASTn. On the command line, you can specify to extract all bacterial, viral, archaeal, protozoan, fungal and helminth sequences. The list of protozoan taxa is based on Kaiju's.
Example command line:
krakenhll-download --db DB --taxa "archaea,bacteria,viral,fungi,protozoa,helminths" --dust --exclude-environmental-taxa nt
To build a custom database with the NCBI taxonomy, first download the taxonomy files with
krakenhll-download --db DBDIR taxonomy
Then you can add the desired sequence files to the DBDIR/library
directory:
cp SEQ1.fa SEQ2.fa DBDIR/library
KrakenHLL needs a sequence ID to taxonomy ID mapping for each sequence. This mappings can be provided in the DBDIR/library/*.map
- KrakenHLL pools all .map
files inside of the library/
folder prior to database building. Format: three tab-separated fields that are, in order, the sequence ID (i. e. the sequence header without '>' up to the first space), the taxonomy ID and the genome or assembly name:
Strain1_Chr1_Seq <tab> 562 <tab> E. Coli Strain Foo
Strain1_Chr2_Seq <tab> 562 <tab> E. Coli Strain Foo
Strain1_Plasmid1_Seq <tab> 562 <tab> E. Coli Strain Foo
Strain2_Chr1_Seq <tab> 621 <tab> S. boydii Strain Bar
Strain2_Plasmid1_Seq <tab> 621 <tab> S. boydii Strain Bar
The third column is optional, and used by KrakenHLL only when --taxids-for-genomes
is specified for krakenhll-build
to add new nodes in the taxonomy tree for the genome. If you'd like to have the sequences identifier in the taxonomy report, too, specifiy --taxids-for-sequences
for krakenhll-build
as well.
Finally, run krakenhll-build
:
krakenhll-build --db DBDIR --taxids-for-genomes --taxids-for-sequences
Note that for custom databases with fewer sequences you might want to choose a smaller k (default: --kmer-len 31
) and minimizer length (default: --minimizer-len 15
).
When using custom taxonomies, please provide DBDIR/taxonomy/nodes.dmp
and DBDIR/taxonomy/names.dmp
according to the format of NCBI taxonomy dumps.