Kmer-db

Kmer-db is a fast and memory-efficient tool for estimating evolutionary distances.

Installation
Usage
Examples
Datasets

1. Installation

Kmer-db comes with a set of precompiled binaries for Windows and Linux. The software can be also built from the sources distributed as:

MAKE project (G++ 4.8.5 tested) for Linux and OS X.
Visual Studio 2015 solution for Windows,

zlib linking

Kmer-db uses zlib for handling gzipped inputs. Under Linux, the software is by default linked against system-installed zlib. Due to issues with some library versions, precompiled zlib is also present the repository. In order to use it, one needs to modify variable INTERNAL_ZLIB at the top of the makefile. Under Windows, the repository library is always used.

AVX and AVX2 support

Kmer-db by default takes advantage of AVX (required) and AVX2 (optional) CPU extensions. Pre-built binary detetermines supported instructions at runtime, thus it is multiplatform. However, one may encounter a problem when building Kmer-db a CPU without AVX2. To prevent from using AVX2, the program must be compiled with NO_AVX2 symbolic constant defined. When building under Linux or OS X, there is a NO_AVX2 switch at the top of the makefile which does the job.

2. Usage

kmer-db <mode> [options] <positional arguments>

Kmer-db operates in one of the following modes:

build - building a database from samples,
all2all - counting common k-mers - all samples in the database,
new2all - counting common k-mers - set of new samples versus database,
one2all - counting common k-mers - single sample versus database,
distance - calculating similarities/distances,
minhash - storing minhashed k-mers,

Common options:

-t <threads> - number of threads (default: number of available cores),

The meaning of other options and positional arguments depends on the selected mode.

2.1. Building a database

Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:

compressed or uncompressed genomes:

kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] <sample_list> <database>
KMC-generated k-mers:

kmer-db build -from-kmers [-f <fraction>] <sample_list> <database>
minhashed k-mers produced by minhash mode:

kmer-db build -from-minhash <sample_list> <database>

Parameters:

sample_list (input) - file containing list of samples in the following format:
```
sample1
sample2
sample3
...
```
By default, the tool requires compressed (.gz/.fna.gz/.fasta.gz) or uncompressed (.fna/.fasta) genome files for each sample (extensions are added automatically). When -from-kmers switch is specified, corresponding KMC-generated k-mer files (.kmc_pre and .kmc_suf) are required. If -from-minhash switch is present, minhashed k-mer files (.minhash) must be generated by minhash command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying -f option.
database (output) - file with generated k-mer database.
-k <kmer-length> - length of k-mers (default: 18); ignored when -from-kmers or -from-minhash switch is specified.
-f <fraction> - fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when -from-minhash switch is present.
-multisample-fasta - each sequence in a genome FASTA file is treated as a separate sample.

2.2. Counting common k-mers

Counting common k-mers for all the samples in the database:

kmer-db all2all [-buffer <size_mb>] <database> <common_table>

Parameters:

database (input) - k-mer database file created by build mode,
common_table (output) - file containing table with common k-mer counts.
-buffer <size_mb> - size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD to maximize performance; default: 8

Counting common k-mers between set of new samples and all the samples in the database:

kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] <database> <sample_list> <common_table>

Parameters:

database (input) - k-mer database file created by build mode.
sample_list (input) - file containing list of samples in one of the supported formats (see build mode); if samples are given as genomes (default) or k-mers (-from-kmers switch), the minhashing is done automatically with the same filter as in the database.
common_table (output) - file containing table with common k-mer counts.
-multisample-fasta / -from-kmers / -from-minhash - see build mode for details.

Counting common k-mers between single sample file and all the samples in the database:

kmer-db one2all [-multisample-fasta|-from-kmers|-from-minhash] <database> <sample> <common_table>

The meaning of the parameters is the same as in new2all mode, but instead of specifying file with sample list, a single sample file is used as a query.

Output format

Modes all2all, new2all, and one2all produce a comma-separated table with number of common k-mers. The table is in the following form:


kmer-length: k fraction: f	db-samples	s₁	s₂	...	s_n
query-samples	total-kmers	\|s₁\|	\|s₂\|	...	\|s_n\|
q₁	\|q₁\|	\|q₁ ∩ s₁\|	\|q₁ ∩ s₂\|	...	\|q₁ ∩ s_n\|
q₂	\|q₂\|	\|q₂ ∩ s₁\|	\|q₂ ∩ s₂\|	...	\|q₂ ∩ s_n\|
...	...	...	...	...	...
q_m	\|q_m\|	\|q_m ∩ s₁\|	\|q_m ∩ s₂\|	...	\|q_m ∩ s_n\|

where:

k - k-mer length,
f - minhash fraction (1, when minhashing is disabled),
s₁, s₂, ..., s_n - database sample names,
q₁, q₂, ..., q_m - query sample names,
|a| - number of k-mers in sample a,
|a ∩ b| - number of k-mers common for samples a and b.

For performance reasons, all2all mode produces a lower triangular matrix.

2.3. Calculating similarities or distances

kmer-db distance [<measures>] <common_table>

Parameters:

common_table (input) - file containing table with numbers of common k-mers produced by all2all, new2all, or one2all mode.
measures - names of the similarity/distance measures to be calculated, can be one or several of the following: jaccard, min, max, cosine, mash. If measures are not specified, jaccard is used by default.
-phylip-out - store output distance matrix in a Phylip format.

This mode generates a file with similarity/distance table for each selected measure. Name of the output file is produced by adding to the input file an extension with a measure name.

2.4. Storing minhashed k-mers

This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build, new2all, or one2all modes with -from-minhash switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:

kmer-db minhash [-k <kmer-length>] [-multisample-fasta] <fraction> <sample_list>

kmer-db minhash -from-kmers <fraction> <sample_list>

Parameters:

fraction (input) - fraction of all k-mers to be accepted by the minhash filter.
sample_list (input) - file containing list of samples in one of the supported formats (see build mode).
-k <kmer-length> - length of k-mers (default: 18); ignored when -from-kmers switch is specified.
-multisample-fasta / -from-kmers - see build mode for details.

For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.

3. Examples

Let pathogens.list be the file containing names of samples (there exist .gz or .fasta genome file for each sample):

acinetobacter
klebsiella
e.coli
...

Calculating similarities/distances between all samples listed in pathogens.list using all 20-mers:

kmer-db build -k 20 pathogens.list pathogens.db
kmer-db all2all pathogens.db matrix.csv
kmer-db distance matrix.csv

Same as above, but using only 10% of 20-mers:

kmer-db build -k 20 -f 0.1 pathogens.list pathogens.db
kmer-db all2all pathogens.db matrix.csv
kmer-db distance matrix.csv

Calculating similarities/distances between samples listed in pathogens.list and salmonella using all 20-mers:

kmer-db build -k 20 pathogens.list pathogens.db
kmer-db one2all pathogens.db salmonella vector.csv
kmer-db distance vector.csv

Same as above, but using only 10% of 20-mers:

kmer-db build -k 20 -f 0.1 pathogens.list pathogens.db
kmer-db one2all pathogens.db salmonella vector.csv
kmer-db distance vector.csv

4. Datasets

List of the pathogens investigated in Kmer-db study can be found here

Citing

Deorowicz, S., Gudyś, A., Długosz, M., Kokot, M., Danek, A. (2018) Kmer-db: instant evolutionary distance estimation, Bioinformatics

zheminzhou/kmer-db

Kmer-db

Table of contents

1. Installation

zlib linking

AVX and AVX2 support

2. Usage

2.1. Building a database

2.2. Counting common k-mers

Counting common k-mers for all the samples in the database:

Counting common k-mers between set of new samples and all the samples in the database:

Counting common k-mers between single sample file and all the samples in the database:

Output format

2.3. Calculating similarities or distances

2.4. Storing minhashed k-mers

3. Examples

4. Datasets

Citing