/kmer-db

Primary LanguageC++GNU General Public License v3.0GPL-3.0

Kmer-db

GitHub downloads

Kmer-db is a fast and memory-efficient tool for estimating evolutionary distances.

Table of contents

  1. Installation
  2. Usage
    1. Building a database
    2. Counting common k-mers
    3. Calculating similarities or distances
    4. Storing minhashed k-mers
  3. Examples
  4. Datasets

1. Installation

Kmer-db comes with a set of precompiled binaries for Windows and Linux. The software can be also built from the sources distributed as:

  • MAKE project (G++ 4.8.5 tested) for Linux and OS X.
  • Visual Studio 2015 solution for Windows,

zlib linking

Kmer-db uses zlib for handling gzipped inputs. Under Linux, the software is by default linked against system-installed zlib. Due to issues with some library versions, precompiled zlib is also present the repository. In order to use it, one needs to modify variable INTERNAL_ZLIB at the top of the makefile. Under Windows, the repository library is always used.

AVX and AVX2 support

Kmer-db by default takes advantage of AVX (required) and AVX2 (optional) CPU extensions. Pre-built binary detetermines supported instructions at runtime, thus it is multiplatform. However, one may encounter a problem when building Kmer-db a CPU without AVX2. To prevent from using AVX2, the program must be compiled with NO_AVX2 symbolic constant defined. When building under Linux or OS X, there is a NO_AVX2 switch at the top of the makefile which does the job.

2. Usage

kmer-db <mode> [options] <positional arguments>

Kmer-db operates in one of the following modes:

  • build - building a database from samples,
  • all2all - counting common k-mers - all samples in the database,
  • new2all - counting common k-mers - set of new samples versus database,
  • one2all - counting common k-mers - single sample versus database,
  • distance - calculating similarities/distances,
  • minhash - storing minhashed k-mers,

Common options:

  • -t <threads> - number of threads (default: number of available cores),

The meaning of other options and positional arguments depends on the selected mode.

2.1. Building a database

Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:

  • compressed or uncompressed genomes:

    kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] <sample_list> <database>

  • KMC-generated k-mers:

    kmer-db build -from-kmers [-f <fraction>] <sample_list> <database>

  • minhashed k-mers produced by minhash mode:

    kmer-db build -from-minhash <sample_list> <database>

Parameters:

  • sample_list (input) - file containing list of samples in the following format:
    sample1
    sample2
    sample3
    ...
    
    By default, the tool requires compressed (.gz/.fna.gz/.fasta.gz) or uncompressed (.fna/.fasta) genome files for each sample (extensions are added automatically). When -from-kmers switch is specified, corresponding KMC-generated k-mer files (.kmc_pre and .kmc_suf) are required. If -from-minhash switch is present, minhashed k-mer files (.minhash) must be generated by minhash command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying -f option.
  • database (output) - file with generated k-mer database.
  • -k <kmer-length> - length of k-mers (default: 18); ignored when -from-kmers or -from-minhash switch is specified.
  • -f <fraction> - fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when -from-minhash switch is present.
  • -multisample-fasta - each sequence in a genome FASTA file is treated as a separate sample.

2.2. Counting common k-mers

Counting common k-mers for all the samples in the database:

kmer-db all2all [-buffer <size_mb>] <database> <common_table>

Parameters:

  • database (input) - k-mer database file created by build mode,
  • common_table (output) - file containing table with common k-mer counts.
  • -buffer <size_mb> - size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD to maximize performance; default: 8

Counting common k-mers between set of new samples and all the samples in the database:

kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] <database> <sample_list> <common_table>

Parameters:

  • database (input) - k-mer database file created by build mode.
  • sample_list (input) - file containing list of samples in one of the supported formats (see build mode); if samples are given as genomes (default) or k-mers (-from-kmers switch), the minhashing is done automatically with the same filter as in the database.
  • common_table (output) - file containing table with common k-mer counts.
  • -multisample-fasta / -from-kmers / -from-minhash - see build mode for details.

Counting common k-mers between single sample file and all the samples in the database:

kmer-db one2all [-multisample-fasta|-from-kmers|-from-minhash] <database> <sample> <common_table>

The meaning of the parameters is the same as in new2all mode, but instead of specifying file with sample list, a single sample file is used as a query.

Output format

Modes all2all, new2all, and one2all produce a comma-separated table with number of common k-mers. The table is in the following form:

kmer-length: k fraction: f db-samples s1 s2 ... sn
query-samples total-kmers |s1| |s2| ... |sn|
q1 |q1| |q1 ∩ s1| |q1 ∩ s2| ... |q1 ∩ sn|
q2 |q2| |q2 ∩ s1| |q2 ∩ s2| ... |q2 ∩ sn|
... ... ... ... ... ...
qm |qm| |qm ∩ s1| |qm ∩ s2| ... |qm ∩ sn|

where:

  • k - k-mer length,
  • f - minhash fraction (1, when minhashing is disabled),
  • s1, s2, ..., sn - database sample names,
  • q1, q2, ..., qm - query sample names,
  • |a| - number of k-mers in sample a,
  • |a ∩ b| - number of k-mers common for samples a and b.

For performance reasons, all2all mode produces a lower triangular matrix.

2.3. Calculating similarities or distances

kmer-db distance [<measures>] <common_table>

Parameters:

  • common_table (input) - file containing table with numbers of common k-mers produced by all2all, new2all, or one2all mode.
  • measures - names of the similarity/distance measures to be calculated, can be one or several of the following: jaccard, min, max, cosine, mash. If measures are not specified, jaccard is used by default.
  • -phylip-out - store output distance matrix in a Phylip format.

This mode generates a file with similarity/distance table for each selected measure. Name of the output file is produced by adding to the input file an extension with a measure name.

2.4. Storing minhashed k-mers

This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build, new2all, or one2all modes with -from-minhash switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:

kmer-db minhash [-k <kmer-length>] [-multisample-fasta] <fraction> <sample_list>

kmer-db minhash -from-kmers <fraction> <sample_list>

Parameters:

  • fraction (input) - fraction of all k-mers to be accepted by the minhash filter.
  • sample_list (input) - file containing list of samples in one of the supported formats (see build mode).
  • -k <kmer-length> - length of k-mers (default: 18); ignored when -from-kmers switch is specified.
  • -multisample-fasta / -from-kmers - see build mode for details.

For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.

3. Examples

Let pathogens.list be the file containing names of samples (there exist .gz or .fasta genome file for each sample):

acinetobacter
klebsiella
e.coli
...

Calculating similarities/distances between all samples listed in pathogens.list using all 20-mers:

kmer-db build -k 20 pathogens.list pathogens.db
kmer-db all2all pathogens.db matrix.csv
kmer-db distance matrix.csv

Same as above, but using only 10% of 20-mers:

kmer-db build -k 20 -f 0.1 pathogens.list pathogens.db
kmer-db all2all pathogens.db matrix.csv
kmer-db distance matrix.csv

Calculating similarities/distances between samples listed in pathogens.list and salmonella using all 20-mers:

kmer-db build -k 20 pathogens.list pathogens.db
kmer-db one2all pathogens.db salmonella vector.csv
kmer-db distance vector.csv

Same as above, but using only 10% of 20-mers:

kmer-db build -k 20 -f 0.1 pathogens.list pathogens.db
kmer-db one2all pathogens.db salmonella vector.csv
kmer-db distance vector.csv

4. Datasets

List of the pathogens investigated in Kmer-db study can be found here

Citing

Deorowicz, S., Gudyś, A., Długosz, M., Kokot, M., Danek, A. (2018) Kmer-db: instant evolutionary distance estimation, Bioinformatics