This repository contains code for the manuscript Efficient minimizer orders for large values of k using minimum decycling sets
c++ source code for Decycling-set-based minimizers is in src
, provided as header files:
minimzers.h
: decycling set based minimizers (up to k=63)
big_minimzers.h
: decycling set based minimizers for k longer than 63. This uses the GMP bignum library for larger integer representations. You must install the GMP library and link it when compiling your code to use these minimizers.
test_minimizers.cpp
: an example of how to use minimizers in your code. It can be used to compute the density for an input sequence or random sequence.
g++ -Wall -O3 -o test_minimizer test_minimizers.cpp -std=c++2a -lgmp -lgmpxx
Note that -lgmp -lgmpxx
are only necessary for big_minimizers
Command line options for test_minimizer:
Usage: ./test_minimizer -k <k> -L <L> -r/--reps -l/--seqlen -m/--method -seq/--seqfile -o/--outfile [-set/--setfile]
k: k-mer length
L: window length
reps: # of randomized repeats
seqlen: Length of random sequence to generate.
If input sequence is read from file and length is non-zero, this length will be sampled from the input.
method: Order to use. Options: 'random', 'docks', 'pasha', 'uhs', 'decycling', 'double', 'set'
seqfile: The sequence is read from this file, or if it doesn't exist, the random sequence will be saved to this file
outfile: File to append output statistics to.
setfile: The file with the set to use. Required when method is 'set', 'uhs', 'pasha', or 'docks'
results
contains tables of raw results reported in the paper
scripts/all_plots.py
contains plotting code to generate the paper figures from the results tables
Efficient minimizer orders for large values of k using minimum decycling sets
David Pellow, Lianrong Pu, Baris Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, Yaron Orenstein
Genome Research https://genome.cshlp.org/content/33/7/1154.short