This repository contains a Snakemake-pipeline for benchmarking KAGE and other genotypers. Benchmarks can be done on both real (experimental) or simulated data. Running all the experiments will take 2-3 days using 16 CPU cores for each genotyper, as some of the genotypers require 10+ hours to run. However, running all genotypers on a small simulated dataset can be done in less than an hour.
The branch v0.0.1 is a freeze of the code used to perform the experiments used in the KAGE manuscript. The Conda yml files in that branch will specify which versions of software were used.
Before you start, you will need both Snakemake (to run the benchmarking pipeline) and Conda (to get all the correct dependencies. Follow the instructions to install Snakemake if you don't have Snakemake allready.
git clone https://github.com/ivargr/genotyping-benchmarking
NOTE: All dependencies will automatically be installed by Conda when you run the pipeline, except for PanGenie (which is currently not available on Conda). You will need to install PanGenie manually, and edit config.yaml to specify the installation path of PanGenie.
The default is to use 16 CPU cores for each method, and 40 CPU-cores to create indexes etc. If you want to change this, edit config.yaml before running.
All depenencies are handled by Conda (meaning you should use --use-conda
with Snakemake) except Python dependencies, as we believe it is nice to have some control over these by installing them using your chosen Python interpreter. Thus, install Python requirements first into your chosen virtual environment:
pip install -r python_requirements.txt
Simply run the following. This will run all the genotypers on a small simulated dataset, specified in config.yaml and create a table with the results.
snakemake -s simulated_experiment.smk --use-conda
If everything goes fine, a file figure11.html
with the following result table will be generated:
+------------+---------------+------------------+-----------+-------------+----------------+---------+---------+--------------+
| | Indels recall | Indels precision | Indels F1 | SNPs recall | SNPs precision | SNPs F1 | Runtime | Memory usage |
+------------+---------------+------------------+-----------+-------------+----------------+---------+---------+--------------+
| KAGE | 0.692 | 0.692 | 0.692 | 0.842 | 0.889 | 0.865 | 0 min | 3 GB |
| PanGenie | 0.769 | 0.714 | 0.741 | 0.789 | 0.714 | 0.750 | 2 min | 48 GB |
| Bayestyper | 0.231 | 1.000 | 0.375 | 0.158 | 1.000 | 0.273 | 1 min | 3 GB |
| Malva | 0.462 | 0.500 | 0.480 | 0.684 | 0.867 | 0.765 | 1 min | 42 GB |
| Graphtyper | 0.077 | 1.000 | 0.143 | 0.158 | 0.167 | 0.162 | 0 min | 0 GB |
| GATK | 0.154 | 0.125 | 0.138 | 0.263 | 0.714 | 0.385 | 0 min | 2 GB |
+------------+---------------+------------------+-----------+-------------+----------------+---------+---------+--------------+
Note: This will take several days and require a lot of RAM. It is possible to pick a subset of methods by editing figures.smk.
snakemake