Rust crate to convert any DNA sequence to an iterator of its k-min-mers. It is useful for rust-mdbg.
This library performs a variant of Run Length Encoding (called Homopolymer Compression) followed by fast rolling hash of DNA sequences. Its purpose is to very quickly convert any DNA sequence into an ordered list of hashes. This then allows to sketch the DNA sequence and do various computations faster, on the sketch rather the initial nucleotides.
For a primer on k-min-mers and minimizer-space representations, see: https://www.cell.com/cell-systems/fulltext/S2405-4712(21)00332-X?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS240547122100332X%3Fshowall%3Dtrue
-
Uses NtHash1 behind the scenes.
-
Allows fast iteration in HPC-space but reports positions in original sequence space.
-
SIMD implementation for Homopolymer Compression
-
SIMD implementation for 32-bit NtHash1 and some 31-bit variant of NtHash2 (modified from bcgsc/ntHash#9)
Very good, thanks for asking 😉 Per thread, scalar ntHash is around 100-200 MB/sec, SIMD ntHash around 1 GB/sec, HPC alone around 4 GB/sec
cargo run # to run a small test in main.rs
cargo run -- [filename.fasta] [nb_threads] # to run a bigger test using an input file (also main.rs)
cargo test # run some tests
cargo bench # run some throughput benchmarks