dna-proto workflow (Snakemake)

This workflow is for analysing genome re-sequencing experiments. It features 2 modes. The de-novo mode is used to confirm sample relationships from the raw sequencing reads with kwip and mash. The varcall mode performs read alignments to one or several reference genomes followed by variant detection. Read alignments can be performed with bwa and/or NextGenMap and variant calling with Freebayes and/or bcftools mpileup. These tools are currently the best performing tools when re-sequencing large plant genomes. Between read alignment and variant calling, PCR duplicates are flagged with samtools markdup and indels realigned with abra2. If a genome annotation is available, variants are annotated with snpEff.

Authors

Norman Warthmann
Marcos Conde
Kevin Murray*

*Core functionality of this workflow is based on PaneucalyptShortReads

Usage

Create a new github repository in your github account using this workflow as a template.
Clone your newly created repository to your local system where you want to perform the analysis.
Setup the software dependencies
Configure the workflow for your needs and input files
Run the workflow
Archive your workflow for documenting your work and easy reproduction.

Some pointers for setup, configuring and running the workflow are below, for details please consult the documentation.

Setup

An easy way to setup the dependencies is conda.

Get the Miniconda Python 3 distribution:

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Create an environment with the required software:

NOTE: conda's enviroment name in these examples is dna-proto.

$ conda env create --name dna-proto --file envs/condaenv.yml

Activate the environment:

$ conda activate dna-proto

Additional useful conda commands are here.

Check config and metadata

We provide scripts to list metadata and configuration parameters in utils/.

python utils/check_metadata.py
python utils/check_config.py

Visualising the workflow

You can check the workflow in graphical form by printing the so-called DAG.

snakemake --dag -npr -j -1 | dot -Tsvg > dag.svg
eog dag.svg

Pretending a run of the workflow

Prior to running the workflow, pretend a run and confirm it will do what is intended.

snakemake  -npr

Data

Main directory content:

.
├── envs
├── genomes_and_annotations
├── metadata
├── output
├── rules
├── scripts
├── utils
├── config.yml
├── Snakefile
├── snpEff.config

NOTE: the output directory and some files in the metadata directory are/will be generated by the workflow.

You will need to configure the workflow for your specific project. For details see the documentation. Below files and directories will need editing:

Snakefile
genomes_and_annotations/
metadata/
config.yml
snpEff.config

You can download example data for testing the workflow. click here to download

warthmann/dna-proto-RC1