Snakemake workflow: ENCODE Demo Workflow

under development

Encode Snakemake

This is a Snakemake workflow to implement the previous encode-demo-pipeline. Specifically, all ENCODE-DCC pipelines that are under snakemake-workflows are in the format encode-<name>-workflow. Here you will find:

Snakefile: is the main entrypoint
config.yaml is the configuration file that stores variables, the idea being that an implementation of the pipeline could customize it. You can read more about configuration files here.
schemas: Most of the configurations here are provided in yaml, and we need a way to validate those files. This folder contains json schemas used to validate those yaml files.
scripts: includes supplementary scripts for the pipeline
rules: are included in the Snakefile, and provide a human friendly way to organize your steps. For example, the main task here is to run the trimming, so there is a trimming.smk file included.
envs: environment settings for runtime (?)

The pipeline demonstrates using the Trimmomatic software to trim input FASTQs. The output includes the trimmed FASTQ and a plot of FASTQ quality scores before and after trimming. For simplicity this demo supports only single-end FASTQs, however since we use the trimmomatric snakemake-wrapper it would be fairly easy to adjust for paired.

Importantly this pipeline is not intended to run on your host. For complete reproducibility, we use containers. While some think that using conda is enough for reproducibility, I do not.

Authors

Vanessa Sochat (@vsoch)

Development

How does it work?

Snakefile

The main Snakemake file is like the entrypoint to your workflow. At the top there are (akin to a Makefile) there is a "make all" section that will list (or expand patterns) to generate the files that should be produced after running the workflow.

Rules

Rules are other tasks in the workflow that are organized in the rules folder. Each is included in the Snakefile to add to be run. The logic of running things comes down to looking at the inputs and outputs for rules, and running them accordingly to generate the final expected outputs.

Samples.csv

While we could write input data as variables in config.yaml, when the number of inputs gets very large this becomes problematic. Instead, we can use a samples.tsv file that records our input data. For example, the file here contains:

sample	condition
1	untreated
2	treated

The numbers "1" and "2" are variables that we want to carry around the pipeline to refer to inputs and outputs. For example, our inputs are named accordingly:

$ ls data/reads/
file1.fastq.gz  file2.fastq.gz

And we would read in the table of samples (1,2) in the first rules like so:

import pandas
samples = pandas.read_csv("samples.tsv", sep="\t").set_index("sample", drop=False)

Notice that we set the index to "sample" to correspond with the column with 1 and 2. We could then feed this index (including both 1 and 2) into a variable to populate an input:

sample=samples.index

For this example, we don't actually use the second column (condition) but it's provided as an example (we could!).

Validation

All of the content in schemas is for validation. For example, after reading in the samples in the section above, we might want to validate that variable, ensuring that fields are defined, and of a particular type. Snakemake provides a function to make it easy for us to do this:

from snakemake.utils import validate
validate(samples, schema="../schemas/samples.schema.yaml")

And so logically, we also do this in the first run of our rules. You wouldn't want to proceed with a pipeline if there is missing data.

Snakemake Wrappers

In rules files you'll notice that a lot of the sections have an attribute for a "wrapper." These are snakemake-wrappers that you literally can copy paste into one of your files to use, making it easy to develop workflows. For example, this unique resource identifier:

...
    wrapper:
        "0.36.0/bio/trimmomatic/se"

corresponds to trimmomatic version 0.36.0. If you are thinking ahead, you are correct that it would be highly useful to develop snakemake-wrappers for tasks that you see repeating more than once, or that you want to make easy for other researchers to use.

Usage

Step 1: Download the Repository

If you simply want to use this workflow, first clone the repository:

$ git clone https://www.github.com/vsoch/encode-demo-workflow

Step 2: Configure workflow

If you want to change configuration options, edit the config.yaml. Otherwise, we will demo usage using the default already built into the container.

Step 3: Execute workflow

Singularity

For this example, we will use Singularity as it's most likely you will want to run this using HPC.

You can pull the container:

$ singularity pull docker://quay.io/vanessa/encode-demo-workflow

And first test the default configuration by performing a dry-run via

$ singularity run encode-demo-workflow_latest.sif -n

And then run the workflow:

$ singularity run encode-demo-workflow_latest.sif

The output files will be in data/trimmed/, assuming that you run the workflow from the repository with the local files bound. You can also generate a report! In the example below, we render an "index.html" to render the report on the master branch via GitHUb pages:

$ singularity run encode-demo-workflow_latest.sif --report index.html

And then clean up:

$ singularity run encode-demo-workflow_latest.sif --delete-all-output

If you want to run this with a job manager, just put the entire command in a single script. The native --cluster commands doin't work with this approach.

Docker

To execute the same workflow using Docker, we have more isolation from the host. We can run the default provided in the container:

$ docker run quay.io/vanessa/encode-demo-workflow -n

To run the workflow (entirely in the container):

$ docker run quay.io/vanessa/encode-demo-workflow

To bind output to the host:

$ docker run -v $PWD/data/trimmed:/code/data/trimmed quay.io/vanessa/encode-demo-workflow

or bind the entire present working directory to be used in the container:

$ docker run -v $PWD:/code quay.io/vanessa/encode-demo-workflow

and for a report:

$ docker run -v $PWD:/code quay.io/vanessa/encode-demo-workflow --report /code/index.html

The report generated will use the root of the repository as the web root so you will want to generate it as index.html to render on GitHub pages.

Development

If you want to build a container that is used in the example above, there is a Dockerfile that is used as a base for Singularity and Docker. The difference from the main Encode container is that we don't install Python 2, and we create an alias for trimmomatic so that the Snakemake wrapper will work. If you need to build the Docker container (and this should be provided for the user in Docker Hub):

$ docker build -f docker/Dockerfile -t quay.io/vanessa/encode-demo-workflow .

And then we pull to the host via Singularity, or Docker.

The expectation is that the container includes all dependencies for the workflow, including a trimmomatic binary. For this particular container, we create an executable that forwards the command to the .jar (Java) file.

See the Snakemake documentation for further details.

Locally

Execute the workflow locally via

$ snakemake --use-conda --cores $N

below not tested yet

using $N cores or run it in a cluster environment via

Step 4: Investigate results

I consider it a bug that the report generation cannot be done using the container so for now this isn't designed. But typically, after successful execution, you can create a self-contained interactive HTML report with all results via:

$ snakemake --report report.html

The command we would want to work is:

$ snakemake --use-singularity --report report.html

This report can, e.g., be forwarded to your collaborators.

Step 5: Clean Up

If you want to clean up output (to try a different backend, or otherwise just remove the files) you can do:

$ snakemake --delete-temp-output
$ snakemake --delete-all-output
Building DAG of jobs...
Deleting data/trimmed/trimmed.file1.fastq.gz
Deleting data/trimmed/trimmed.file2.fastq.gz
Deleting data/file1_untrimmed_file1_trimmed_quality_scores.png
Deleting data/file2_untrimmed_file2_trimmed_quality_scores.png

And then have a clean slate to start again.

Advanced

The following recipe provides established best practices for running and extending this workflow in a reproducible way.

Fork the repo to a personal or lab account.
Clone the fork to the desired working directory for the concrete project/run on your machine.
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
Modify the config, and any necessary sheets (and probably the workflow) as needed.
Commit any changes and push the project-branch to your fork on github.
Run the analysis.
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request. This would be greatly appreciated.
Optional: Push results (plots/tables) to the remote branch on your fork.
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
Optional: Delete the local clone/workdir to free space.

Testing

Tests cases are in the subfolder .test. They are automatically executed via continuous integration with Travis CI.