/ntHits

Identifying repeats in high-throughput sequencing data

Primary LanguageC++MIT LicenseMIT

ntHits

Logo

ntHits is a tool for efficiently counting and filtering k-mers based on their frequencies.

Dependencies

  • C++ compiler with c++17 and OpenMP support
  • Meson
  • btllib (>=1.7.2)
  • Catch2, only for running tests

ntHits uses argparse for command-line argument parsing which is built-in as a submodule (no further installation required).

Installation

NOTE: If you are installing btllib from its source, run its ./compile script and add the following environment variables:

export CPPFLAGS="-isystem /path/to/btllib/install/include $CPPFLAGS"
export LDFLAGS="-L/path/to/btllib/install/lib -lbtllib $LDFLAGS"

Download the latest release and run the following command in the project's root directory to create a buildsystem in the build folder:

meson setup build

Then, cd into the build folder and compile ntHits using:

ninja

This will generate two binary files in the build folder: nthits for generating the desired data structure containing the k-mers and if possible, their counts; and nthits-bfq for querying the output if it's a (counting) Bloom filter.

Usage

Usage: nthits --frequencies VAR [--min-count VAR] [--max-count VAR] [--kmer-length VAR] [-h] [--error-rate VAR] [--seeds VAR] [--threads VAR] [--solid] [--long-mode] --out-file VAR out_type files

Filters k-mers based on counts (cmin <= count <= cmax) in input files

Positional arguments:
  out_type              Output format: Bloom filter 'bf', counting Bloom filter ('cbf'), or table ('table') [required]
  files                 Input files [nargs: 0 or more] [required]

Optional arguments:
  -f, --frequencies     Frequency histogram file (e.g. from ntCard) [required]
  -cmin, --min-count    Minimum k-mer count (>=1), ignored if using --solid [default: 1]
  -cmax, --max-count    Maximum k-mer count (<=254) [default: 254]
  -k, --kmer-length     k-mer length, ignored if using spaced seeds (-s) [default: 64]
  -h, --num-hashes      Number of hashes to generate per k-mer/spaced seed [default: 3]
  -p, --error-rate      Target Bloom filter error rate [default: 0.0001]
  -s, --seeds           If specified, use spaced seeds (separate with commas, e.g. 10101,11011) 
  -t, --threads         Number of parallel threads [default: 4]
  --solid               Automatically tune 'cmin' to filter out erroneous k-mers 
  --long-mode           Optimize data reader for long sequences (>5kbp) 
  -v                    Level of details printed to stdout (-v: normal, -vv detailed) 
  -o, --out-file        Output file's name [required]

Copyright 2022 Canada's Michael Smith Genome Science Centre

If the output data structure is a Bloom filter (or CBF), it can be queried by either using the nthits-bfq tool, or using btllib's API.

ntHits Bloom Filter Query Tool

Usage: nthits-bfq [-h] [--cbf] [--seeds VAR] [--silent] bf_path

Query tool for ntHits' output Bloom filter

Positional arguments:
  bf_path       Input Bloom filter file [required]

Optional arguments:
  -h, --help    shows help message and exits 
  -v, --version prints version information and exits 
  --cbf         Treat input file as a counting Bloom filter and output k-mer counts 
  -s, --seeds   Spaced seed patterns separated with commas (e.g. 10101,11011) 
  --silent      Don't print logs to stdout 

Copyright 2022 Canada's Michael Smith Genome Science Centre

btllib's Bloom Filter API

C++ example:

#include <btllib/bloom_filter.hpp>
#include <btllib/counting_bloom_filter.hpp>
#include <string>

int main() {
  btllib::KmerBloomFilter bf(path_to_bloom_filter);
  // or btllib::KmerCountingBloomFilter8 
  std::string kmer = "AGCTATCAGTCGA";
  std::cout << bf.contains(kmer) << std::endl;
  return 0;
}

Python example:

import btllib

bf = btllib.KmerBloomFilter(path_to_bloom_filter)
# or btllib.KmerCountingBloomFilter8
kmer = "AGCTATCAGTCGA"
print(bf.contains(kmer))

If using spaced seeds, btllib's BloomFilter and CountingBloomFilter classes should be used instead. In this case, refer to btllib's docs and examples to query the Bloom filters using hashes generated from a SeedNtHash object.