/kmer-db

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).

Primary LanguageC++GNU General Public License v3.0GPL-3.0

Kmer-db

GitHub downloads Bioconda downloads Build and tests License

x86-64 ARM Apple M Windows Linux macOS

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).

Quick start

git clone --recurse-submodules https://github.com/refresh-bio/kmer-db
cd kmer-db && make

INPUT=./test/virus
OUTPUT=./output
mkdir $OUTPUT

# build a database from all 18-mers (default) contained in a set of sequences
./kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.db

# establish numbers of common k-mers between new sequences and the database
./kmer-db new2all $OUTPUT/k18.db $INPUT/seqs.part2.list $OUTPUT/n2a.csv

# calculate jaccard index from common k-mers
./kmer-db distance jaccard $OUTPUT/n2a.csv n2a.jaccard

# extend the database with new sequences
./kmer-db build -extend $INPUT/seqs.part2.list $OUTPUT/k18.db

# establish numbers of common k-mers between all sequences in the database
./kmer-db all2all $OUTPUT/k18.db $OUTPUT/a2a.csv

# build a database from 10% of 25-mers using 16 threads
./kmer-db build -k 25 -f 0.1 -t 16 $INPUT/seqs.part1.list $OUTPUT/k25.db

# establish number of common 25-mers between single sequence and the database 
# (minhash filtering that retains 10% of MT159713 k-mers is done automatically prior to the comparison)  
./kmer-db one2all $OUTPUT/k25.db $INPUT/data/MT159713.fasta $OUTPUT/MT159713.csv

# build two partial databases
./kmer-db build $INPUT/seqs.part1.list  $OUTPUT/k18.parts1.db
./kmer-db build $INPUT/seqs.part2.list  $OUTPUT/k18.parts2.db

# establish numbers of common k-mers between all sequences in the databases,
# computations are done in the sparse mode, the output matrix is also sparse
echo $OUTPUT/k18.parts1.db > $OUTPUT/db.list
echo $OUTPUT/k18.parts2.db >> $OUTPUT/db.list
./kmer-db all2all-parts $OUTPUT/db.list $OUTPUT/k18.parts.csv

Table of contents

  1. Installation
  2. Usage
    1. Building a database
    2. Counting common k-mers
    3. Calculating similarities or distances
    4. Storing minhashed k-mers
  3. Datasets

1. Installation

Kmer-db comes with a set of precompiled binaries for Linux, macOS, and Windows. The software is also available on Bioconda:

conda install -c bioconda kmer-db

For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. Kmer-db can be also built from the sources distributed as:

  • MAKE project (C++-20-compatible compiler required, e.g., g++-11) for Linux and macOS,
  • Visual Studio 2022 solution for Windows.

Vector extensions

Kmer-db can be built for x86-64 and ARM64 8 architectures (including Apple Mx based on ARM64 8.4 core) and takes advantage of AVX2 (x86-64) and NEON (ARM) CPU extensions. The default target platform is x86-64 with AVX2 extensions. This, however, can be changed by setting PLATFORM variable for make:

make PLATFORM=none    # unspecified platform, no extensions
make PLATFORM=sse2    # x86-64 with SSE2 
make PLATFORM=avx     # x86-64 with AVX 
make PLATFORM=avx2    # x86-64 with AVX2 (default)
make PLATFORM=native  # x86-64 with AVX2 and native architecture
make PLATFORM=arm8    # ARM64 8 with NEON  
make PLATFORM=m1      # ARM64 8.4 (especially Apple M1) with NEON 

Note, that x86-64 binaries determine the supported extensions at runtime, which makes them backwards-compatible. For instance, the AVX executable will also work on SSE-only platform, but with limited performance.

2. Usage

kmer-db <mode> [options] <positional arguments>

Kmer-db operates in one of the following modes:

  • build - building a database from samples,
  • all2all - counting common k-mers - all samples in the database,
  • all2all-sp - counting common k-mers - all samples in the database (sparse computation),
  • all2all-parts - counting common k-mers - all samples in the database parts (sparse computation),
  • new2all - counting common k-mers - set of new samples versus database,
  • one2all - counting common k-mers - single sample versus database,
  • distance - calculating similarities/distances,
  • minhash - storing minhashed k-mers.

Common options:

  • -t <threads> - number of threads (default: number of available cores),

The meaning of other options and positional arguments depends on the selected mode.

2.1. Building a database

Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:

  • compressed or uncompressed genomes/reads:

    kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] [-extend] [-t <threads>] <sample_list> <database>

  • KMC-generated k-mers:

    kmer-db build -from-kmers [-f <fraction>] [-extend] [-t <threads>] <sample_list> <database>

  • minhashed k-mers produced by minhash mode:

    kmer-db build -from-minhash [-extend] [-t <threads>] <sample_list> <database>

Parameters:

  • sample_list (input) - file containing list of samples in the following format:
    sample_file_1
    sample_file_2
    sample_file_3
    ...
    
    By default, the tool requires uncompressed or compressed FASTA files for each sample. If a file on the list cannot be found, the package tries adding the following extensions: fna, fasta, gz, fna.gz, fasta.gz . When -from-kmers switch is specified, corresponding KMC-generated k-mer files (.kmc_pre and .kmc_suf) are required. If -from-minhash switch is present, minhashed k-mer files (.minhash) must be generated by minhash command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying -f option.
  • database (output) - file with generated k-mer database,
  • -k <kmer-length> - length of k-mers (default: 18); ignored when -from-kmers or -from-minhash switch is specified,
  • -f <fraction> - fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when -from-minhash switch is present,
  • -multisample-fasta - each sequence in a FASTA file is treated as a separate sample,
  • -extend - extend the existing database with new samples,
  • -t <threads> - number of threads (default: number of available cores).

2.2. Counting common k-mers

Samples in the database against each other:

Dense computations - recomended when the distance matrix contains few zeros. Output can be stored in the dense or sparse form (-sparse switch).

kmer-db all2all [-buffer <size_mb>] [-t <threads>] [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <database> <common_table>

Sparse computations - recommended when the distance matrix contains many zeros. Output matrix is always in the sparse form:

kmer-db all2all-sp [-buffer <size_mb>] [-t <threads>] [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* [-sample-rows [<criterion>:]<count>] <database> <common_table>

Sparse computations, partial databases - use when the distance matrix contains many zeros and there are multiple partial databases. Output matrix is always in the sparse form:

kmer-db all2all-parts [-buffer <size_mb>] [-t <threads>] [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* [-sample-rows [<criterion>:]<count>] <db_list> <common_table>

Parameters:

  • database (input) - k-mer database file created by build mode,
  • db_list (input) - file containing list of databases files created by build mode,
  • common_table (output) - file containing table with common k-mer counts,
  • -buffer <size_mb> - size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8,
  • -t <threads> - number of threads (default: number of available cores),
  • -sparse - stores output matrix in a sparse form (always on in all2all-sp and all2all-parts modes),
  • -min [<criterion>:]<value> - retains elements with criterion greater than or equal to value (see details below),
  • -max [<criterion>:]<value> - retains elements with criterion lower than or equal to value (see details below),
  • -sample-rows [<criterion>:]<count> - retains count elements in every row using one of the strategies: (i) random selection (no criterion); (ii) the best elements with respect to criterion.

criterion can be num-kmers (number of common k-mers) or one of the distance/similarity measures: jaccard, min, max, cosine, mash, ani, ani-shorder (see 2.3 for definitions). No criterion indicates num-kmers (filtering) or random elements selection (sampling). Multiple filters can be combined.

New samples against the database:

kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] [-t <threads>] [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <database> <sample_list> <common_table>

Parameters:

  • database (input) - k-mer database file created by build mode,
  • sample_list (input) - file containing list of samples in one of the supported formats (see build mode); if samples are given as genomes (default) or k-mers (-from-kmers switch), the minhashing is done automatically with the same filter as in the database,
  • common_table (output) - file containing table with common k-mer counts,
  • -multisample-fasta / -from-kmers / -from-minhash - see build mode for details,
  • -t <threads> - number of threads (default: number of available cores),
  • -sparse - stores output matrix in a sparse form,
  • -min [<criterion>:]<value> - retains elements with criterion greater than or equal to value (see details below),
  • -max [<criterion>:]<value> - retains elements with criterion lower than or equal to value (see details below),

criterion can be num-kmers (number of common k-mers) or one of the distance/similarity measures: jaccard, min, max, cosine, mash, ani, ani-shorder (see 2.3 for definitions). No criterion indicates num-kmers. Multiple filters can be combined.

Single sample against the database:

kmer-db one2all [-from-kmers | -from-minhash] [-t <threads>] <database> <sample> <common_table>

The meaning of the parameters is the same as in new2all mode, but instead of specifying file with sample list, a single sample file is used as a query.

Output format

Modes all2all, all2all-sp, all2all-parts, new2all, and one2all produce a comma-separated table with numbers of common k-mers. For all2all, new2all, and one2all modes, the table is by default stored in a dense form:

kmer-length: k fraction: f db-samples s1 s2 ... sn
query-samples total-kmers |s1| |s2| ... |sn|
q1 |q1| |q1 ∩ s1| |q1 ∩ s2| ... |q1 ∩ sn|
q2 |q2| |q2 ∩ s1| |q2 ∩ s2| ... |q2 ∩ sn|
... ... ... ... ... ...
qm |qm| |qm ∩ s1| |qm ∩ s2| ... |qm ∩ sn|

where:

  • k - k-mer length,
  • f - minhash fraction (1, when minhashing is disabled),
  • s1, s2, ..., sn - database sample names,
  • q1, q2, ..., qm - query sample names,
  • |a| - number of k-mers in sample a,
  • |a ∩ b| - number of k-mers common for samples a and b.

When -sparse switch is specified or all2all-sp, all2all-parts modes are used, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (column_id: value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:

kmer-length: k fraction: f db-samples s1 s2 ... sn
query-samples total-kmers |s1| |s2| ... |sn|
q1 |q1| i11: |q1 ∩ si11| i12: |q1 ∩ si12|
q2 |q2| i21: |q2 ∩ si21| i22: |q2 ∩ si22| i23: |q2 ∩ si23|
q2 |q2|
... ... ...
qm |qm| im1: |qm ∩ sim1|

For performance reasons, all2all, all2all-sp, and all2all-parts modes produce a lower triangular matrix.

2.3. Calculating similarities or distances

kmer-db distance <measure> [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <common_table> <output_table>

Parameters:

  • measure - names of the similarity/distance measure to be calculated, can be one of the following:
    • jaccard: $J(q,s) = |p \cap q| / |p \cup q|$,
    • min: $\min(q,s) = |p \cap q| / \min(|p|,|q|)$,
    • max: $\max(q,s) = |p \cap q| / \max(|p|,|q|)$,
    • cosine: $\cos(q,s) = |p \cap q| / \sqrt{|p| \cdot |q|}$,
    • mash (Mash distance): $\textrm{Mash}(q,s) = -\frac{1}{k}ln\frac{2 \cdot J(q,s)}{1 + J(q,s)}$,
    • ani (average nucleotide identity): $\textrm{ANI}(q,s) = 1 - \textrm{Mash}(p,q)$,
    • ani-shorter - same as ani but with min used instead of jaccard.
  • common_table (input) - file containing table with numbers of common k-mers produced by all2all, new2all, or one2all mode (both, dense and sparse matrices are supported),
  • output_table (output) - file containing table with calculated distance measure,
  • -phylip-out - store output distance matrix in a Phylip format,
  • -sparse - outputs a sparse matrix (only for dense input matrices - sparse inputs always produce sparse outputs),
  • -min [<criterion>:]<value> - retains elements with criterion greater than or equal to value (see details below),
  • -max [<criterion>:]<value> - retains elements with criterion lower than or equal to value (see details below),

criterion can be num-kmers (number of common k-mers) or one of the distance/similarity measures: jaccard, min, max, cosine, mash, ani, ani-shorder (see 2.3 for definitions). If no criterion is specified, measure argument is used by default. Multiple filters can be combined.

2.4. Storing minhashed k-mers

This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build, new2all, or one2all modes with -from-minhash switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:

kmer-db minhash [-f <fraction>] [-k <kmer-length>] [-multisample-fasta] <sample_list>

kmer-db minhash -from-kmers [-f <fraction>] <sample_list>

Parameters:

  • sample_list (input) - file containing list of samples in one of the supported formats (see build mode),
  • -f <fraction> - fraction of all k-mers to be accepted by the minhash filter (default: 0.01),
  • -k <kmer-length> - length of k-mers (default: 18; maximum: 30); ignored when -from-kmers switch is specified,
  • -multisample-fasta / -from-kmers - see build mode for details.

For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.

3. Datasets

List of the pathogens investigated in Kmer-db study can be found here

Citing

Deorowicz, S., Gudyś, A., Długosz, M., Kokot, M., Danek, A. (2019) Kmer-db: instant evolutionary distance estimation, Bioinformatics, 35(1): 133–136