Skmer is a fast tool for estimating distances between genomes from low-coverage sequencing reads (genome-skims), without needing any assembly or alignment step. The paper where we have described the methods and tested Skmer on simulated short reads and SRA's from previous sequencing experiments is available online (open access):
Skmer is a command-line tool implemented in python. It runs Jellyfish and Mash internally to efficiently compute k-mer profile of genome-skims and their intersection, and estimates the genomic distances by correcting for the effect of low coverage and sequencing error. Skmer also depends on seqtk for some FASTQ/A processings.
On 64-bit Linux and Mac OSX, you can install Skmer from bioconda channel using conda package manager.
- Install Miniconda (you can skip this if you already have either of Miniconda or Anaconda installed).
- Add the bioconda channel by running the following commands in your terminal (order matters):
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
- Run the following command to install Skmer (and all dependencies)
conda install skmer
Alternatively, and for all other OS and architectures, you can download the github repository and install Skmer using the setup script.
- You need to have python 2.7 or later installed
- Install Jellyfish (v2.2.6 or later), Mash (v1.1 or later), and seqtk (v1.3), and add the path to their binary to the system path (so you can, e.g., run
jellyfish --version
,mash --version
, andseqtk
successfully in the terminal). - Clone the github repository by running (or you can download the repo)
git clone https://github.com/shahab-sarmashghi/Skmer.git
- Change to the Skmer directory and run
python setup.py install
Skmer has three sub-commands:
Gets the path to a directory of FASTQ/FASTA files (one uncompressed .fastq/.fq/.fa/.fna/.fasta file per each sample) and creates a reference library containing the estimates of sequencing parameters as well as the Mash sketch for each sample. If the input is an assembled sequence (determined by the length of sequences) the correction for low coverage and sequencing error is not applied to that sample. All corrected pairwise genomic distances are then estimated and written to a file. For a test run, change the directory to data
under your Skmer installation directory, and run
skmer reference ref_dir -p 4
The genome-skims and assemblies in ref_dir
directory are processed (using 4 cores in parallel), and a reference library
is created in the working directory. You can specify a custom name (and so its path) for your library using -l
option
skmer reference ref_dir -l custom_library_name
Default k-mer size is set to 31
which is the maximum length allowed by Mash, and can be changed using -k
option. We do not recommend using k-mers smaller than ~21
, as k-mers without any shared evolutionary history start to seem similar just out of random chance. The sketch size can also be changed using -s
option from its default value 10000000
. Decreasing the sketch size will reduce the size of library on disk, but also compromises the accuracy of distance estimation. The corrected pairwise distances are estimated and written to the file ref-dist-mat.txt
in the working directory by default. The output prefix can be changed using -o
option
skmer reference ref_dir -o output_prefix
If distances are going to be used to build phylogenies, it is recommended to use -t
flag. In this case, the estimated distances are transformed to the phylogenetic distances using the Jukes-Cantor model of substitution. Run skmer reference -h
for the complete list of arguments and options.
Computes all pairwise distances for a processed library. The main usage is to compute distances when combining already processed libraries, otherwise reference
command outputs distances as well when the input files are processed. For example, in data
directory, assuming that you have already run reference
and compiled the reference library
, try
skmer distance library -t -o jc-dist-mat
The distances between all samples in the library
are computed, and after applying Jukes-Cantor transformation, the mutation rates are written to jc-dist-mat.txt
, which can be later used to build a phylogeny based on distances. To see the help message, run skmer distance -h
.
Processes a query genome-skim or assembly, and outputs the sorted list of reference samples based on their distance to the query. Optionally, the query can be added to the reference library. To test its function, assuming that you have already run reference
and compiled the reference library
, in data
directory run
skmer query qry.fastq library
The sorted list of reference species and their distances from the query is written to dist-qry.txt
. You can change the output prefix from dist
to something else using -o
option
skmer query qry.fastq library -o output_prefix
If you want to add the processed query to the reference library and include it as a reference for future comparisons, use -a
flag. To see the complete list of inputs and options, run skmer query -h
.