ENCODE DNase-seq pipeline

Introduction

This is the uniform processing pipeline for ENCODE DNase-seq experiments. It uses WDL and Caper (a wrapper over Cromwell) to specify and run the processing steps, and Docker to ensure a reproducible and portable compute environment. The same inputs should produce the same outputs regardless of computational platform (local, Cloud, etc.). The benefits of having a reproducible workflow include generating processed results that are comparable between ENCODE experiments and providing outside users an easy way to process their own DNAse-seq data for comparison with ENCODE data. See ENCODE's ChIP-seq and ATAC-seq pipelines for other examples of reproducible workflows.

Requirements

Follow the instructions for installing Caper.

Quickstart

Assuming the requirements are installed you must create a JSON file specifying the inputs of the pipeline. These include the raw FASTQs from the DNase-seq experiment as well as reference files specific to an assembly (e.g. GRCh38) and read length (e.g. 76bp). There are three sections in the input JSON:

{
    "dnase.replicates": [],
    "dnase.references": {},
    "dnase.machine_sizes": {}
}

dnase.replicates takes a list of biological replicates. Here's an example of how to specify the input for ENCSR693UHT, which has a single replicate and two paired-end FASTQs (where R1 and R2 specify the pair):

"dnase.replicates": [
    {
        "accession": "ENCSR693UHT",
        "number": 1,
        "read_length": 76,
        "adapters": {
            "sequence_R1": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTCTCTACATCTCGTATGCCGTCTTCTGCTTG",
            "sequence_R2": "AATGATACGGCGACCACCGAGATCTACACCTCTCTACACACTCTTTCCCTACACGACGCTCTTCCGATCT"
        },
        "pe_fastqs": [
            {
                "R1": "s3://encode-public/2019/08/21/af5c625b-b188-4b17-9e78-5463a60d0336/ENCFF201UOC.fastq.gz",
                "R2": "s3://encode-public/2019/08/21/a1ecf95f-713c-4890-95a1-88014b69ef2f/ENCFF149BGS.fastq.gz"
            },
            {
                "R1": "s3://encode-public/2019/08/21/766fecd1-258e-4a92-b725-7fdb6aa77a07/ENCFF752UBC.fastq.gz",
                "R2": "s3://encode-public/2019/08/21/87400548-896f-4b1e-921f-73127a9db3e2/ENCFF559DWT.fastq.gz"
            }
        ]
    }
]

Here's how to specify the input for a 36bp single-end experiment with one replicate like ENCSR420RWU:

"dnase.replicates": [
    {
        "accession": "ENCSR420RWU",
        "number": 1,
        "read_length": 36,
        "se_fastqs": [
            "s3://encode-public/2014/07/04/580edf6f-10a4-4b42-b4a2-4f64a9a87901/ENCFF002DYE.fastq.gz",
            "s3://encode-public/2014/07/04/f123c3c1-3542-4eda-b2d0-7d789f95cea0/ENCFF002DYD.fastq.gz"
        ]
    }
]

Note that adapters aren't required for single-end experiments.

Here's how to specify the input for a 36bp mixed single-end/paired-end experiment with two replicates like ENCSR426IEA:

"dnase.replicates": [
    {
        "accession": "ENCSR426IEA",
        "number": 1,
        "read_length": 36,
        "adapters": {
            "sequence_R1": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGTTGCATCTCGTATGCCGTCTTCTGCTTG",
            "sequence_R2": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT"
        },
        "pe_fastqs": [
             {
                 "R1": "s3://encode-public/2014/09/27/8a44e7ad-7586-49bd-ad62-6933981fdb17/ENCFF178WLB.fastq.gz",
                 "R2": "s3://encode-public/2014/09/27/2059b738-4e97-4553-a93b-3a112fc34969/ENCFF645INN.fastq.gz"
             },
             {
                 "R1": "s3://encode-public/2014/09/27/cd94b937-c67b-4b54-8b2e-13ab75ad0608/ENCFF377AOZ.fastq.gz",
                 "R2": "s3://encode-public/2014/09/27/0699c990-5072-42d2-9d70-10711853f487/ENCFF536KHM.fastq.gz"
             },
             {
                 "R1": "s3://encode-public/2014/09/27/86a8a345-7c10-4789-9e7a-ca73fc7160e3/ENCFF665FKP.fastq.gz",
                 "R2": "s3://encode-public/2014/09/27/d0631a63-ea91-4c48-a2ec-c93fe4cd72a4/ENCFF580MND.fastq.gz"
             },
             {
                 "R1": "s3://encode-public/2014/09/27/a60a7b84-d9da-48a7-be29-dbaa7a0094be/ENCFF901FSW.fastq.gz",
                 "R2": "s3://encode-public/2014/09/27/96561da6-f942-492b-909f-94a1ad48cea6/ENCFF719IMH.fastq.gz"
             }
         ],
         "se_fastqs": [
             "s3://encode-public/2014/09/27/323b2f10-1ea9-44e6-b366-e6dc0cfd7006/ENCFF302JOP.fastq.gz"
         ]
    },
    {
        "accession": "ENCSR426IEA",
        "number": 2,
        "read_length": 36,
        "adapters": {
            "sequence_R1": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCCGTCTTCTGCTTG",
            "sequence_R2": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT"
        },
        "pe_fastqs": [
            {
                "R1": "s3://encode-public/2014/09/27/a4c088cb-654f-4bb8-8d74-20301711b67b/ENCFF266WWA.fastq.gz",
                "R2": "s3://encode-public/2014/09/27/0760ec09-c0e9-4c5b-a970-52c17d062e2a/ENCFF314WHE.fastq.gz"
            }
        ],
        "se_fastqs": [
            "s3://encode-public/2014/09/27/9cf1e90b-e2a6-41dc-bbb6-2aaaefcb9d83/ENCFF355JSW.fastq.gz"
        ]
    }
]

In some cases it is necessary to run preseq lc_extrap in defects mode. The option is false by default and can be defined separately for each replicate. Here's how to specify the input for a 36bp single-end experiment with one replicate like ENCSR420RWU with defects mode:

"dnase.replicates": [
    {
        "preseq_defects_mode": true,
        "accession": "ENCSR420RWU",
        "number": 1,
        "read_length": 36,
        "se_fastqs": [
            "s3://encode-public/2014/07/04/580edf6f-10a4-4b42-b4a2-4f64a9a87901/ENCFF002DYE.fastq.gz",
            "s3://encode-public/2014/07/04/f123c3c1-3542-4eda-b2d0-7d789f95cea0/ENCFF002DYD.fastq.gz"
        ]
    }
]

The dnase_no_footprints.wdl can be used to skip the footprinting part of the workflow. This uses the same input as the dnase.wdl workflow and otherwise produces identical outputs.

dnase.references takes read-length specialized reference files for Hotspot1/Hotspot2 steps, as well as standard genomic indices required by the pipeline:

"dnase.references": {
    "genome_name": "GRCh38",
    "indexed_fasta": {
        "fasta": "gs://dnase/ref/GRCh38.fa",
        "fai": "gs://dnase/ref/GRCh38.fa.fai"
    },
    "bwa_index": {
        "bwt": "gs://dnase/ref/GRCh38.fa.bwt",
        "fasta": "gs://dnase/ref/GRCh38.fa",
        "pac": "gs://dnase/ref/GRCh38.fa.pac",
        "ann": "gs://dnase/ref/GRCh38.fa.ann",
        "amb": "gs://dnase/ref/GRCh38.fa.amb",
        "sa": "gs://dnase/ref/GRCh38.fa.sa"
    },
    "nuclear_chroms": "gs://dnase/ref/nuclear_chroms.txt",
    "narrow_peak_auto_sql": "gs://dnase/ref/bigbed/narrowPeak.as",
    "hotspot1": {
        "chrom_info": "gs://dnase/ref/GRCh38.chromInfo.bed",
        "mappable_regions": "gs://dnase/ref/36/GRCh38.K36.mappable_only.bed"
    },
    "hotspot2": {
        "chrom_sizes": "gs://dnase/ref/GRCh38.chrom_sizes.bed",
        "center_sites": "gs://dnase/ref/36/GRCh38.K36.center_sites.n100.starch",
        "mappable_regions": "gs://dnase/ref/36/GRCh38.K36.mappable_only.bed"
    },
    "bias_model": "gs://dnase/ref/vierstra_et_al.txt"
}

In most cases references should already exist for a given assembly/read length, though the references.wdl workflow can be used to generate most of these for a new assembly/read length. Note that it is important for the genome_name to correspond to the prefix of your references/indices. You can also specify a txt.gz file for references that are text files, and a tar.gz archive for references with multiple files:

"dnase.references": {
    "genome_name": "GRCh38",
    "indexed_fasta_tar_gz": "gs://dnase/ref/tarballs/GRCh38/indexed_fasta.tar.gz",
    "bwa_index_tar_gz": "gs://dnase/ref/tarballs/GRCh38/bwa_index.tar.gz",
    "nuclear_chroms_gz": "gs://dnase/ref/nuclear_chroms.txt.gz",
    "narrow_peak_auto_sql": "gs://dnase/ref/bigbed/narrowPeak.as",
    "hotspot1_tar_gz": "gs://dnase/ref/tarballs/GRCh38/76/hotspot1.tar.gz",
    "hotspot2_tar_gz": "gs://dnase/ref/tarballs/GRCh38/76/hotspot2.tar.gz",
    "bias_model_gz": "gs://dnase/ref/vierstra_et_al.txt.gz"
}

dnase.machine_sizes specifies the desired compute size for specific steps of the pipeline (when custom instances can be specified).

"dnase.machine_sizes": {
    "align": "large2x",
    "concatenate": "medium",
    "convert": "medium",
    "filter": "large",
    "footprint": "medium2x",
    "mark": "large",
    "merge": "large",
    "normalize": "medium2x",
    "peaks": "medium2x",
    "qc": "medium2x",
    "score": "medium",
    "trim": "large2x",
    "unpack": "medium"
}

The sizes are defined in wdl/runtimes.json. For example medium corresponds to these resources:

"medium": {
    "cpu": 2,
    "memory_gb": 13,
    "disks": "local-disk 200 SSD"
}

Note that specifying dnase.machine_sizes is optional and default values from wdl/default_machine_sizes.json will be used if it is not specified in the input JSON.

You can use the 76bp dnase_template.json or the 36bp dnase_template.json for GRCh38 to avoid having to fill in the references and machine_size sections.

Steps

The pipelines can be roughly split up into these parts:

Concatenate, trim, and align fastqs - The raw FASTQs are concatenated, trimmed for adapters and length, and aligned with BWA (single-end and paired-end data are kept separate).
Merge, mark, and filter BAMs - The aligned PE/SE BAMs are merged and duplicates are marked. Low-quality and non-nuclear reads are filtered out.
Call hotspots and peaks and get SPOT score - The nuclear BAM is passed to Hotspots1 and Hotspot2 for peaks and SPOT score.
Call footprints - A footprint model is fit and statistical deviations from expected cleavage rates within Hotspot regions are called and thresholded.
Calculate and gather QC - Quality metrics such as samtools flagstats are collected.
Normalize and convert files - The density starch from Hotspot2 is normalized and converted to a bigWig, peak starches are converted to bed and bigBed format.

Running with Caper

Once the input JSON has been specified and placed somewhere accessible (e.g in a Google Cloud bucket) you can launch the workflow:

caper submit dnase.wdl --inputs gs://dnase/json/ENCSR426IEA.json

You can check on the status of the workflow with:

caper list --hide-subworkflow

Outputs

Files


unfiltered bam
nuclear bam
normalized density bw
five percent allcalls bed gz
five percent allcalls bigbed
five percent narrowpeaks bed gz
five percent narrowpeaks bigbed
tenth of one percent narrowpeaks bed gz
tenth of one percent narrowpeaks bigbed
tenth of one percent peaks starch (summit information)
one percent footprints bed gz
one percent footprints bigbed

Quality Metrics

unfiltered_bam
trimstats
stats
flagstats
bamcounts

nuclear_bam
stats
flagstats
hotspot1 spot score
duplication metrics
preseq
preseq targets
insert size metrics
insert size info
insertsize histogram pdf

peaks
hotspot2 spot score
five percent allcalls count
five percent hotspots count
five percent narrowpeaks count
tenth of one percent narrowpeaks count

footprints
dispersion model
one percent footprints count

(Note that insert sizes and trimstats only included for PE data.)

Development and Repository Organization

The wdl folder is split up into:

tasks
subworkflows
workflows
structs

Tasks wrap generic command-line utilities. Subworkflows use one or more tasks to create DNase-seq-specific utilities. Workflows combine mutliple subworkflows into a conceptually useful unit of work for consumption by the end user. Structs define the abstract groups of parameters and files that should be passed around together.

The tests folder is split up into:

unit
integration
functional

In general unit tests ensure proper wrapping of command-line utilities by tasks. We are only checking for proper syntax of the command that is run, not its output. Integration tests ensure proper combination of tasks in subworkflows. We are using test input data and comparing actual output with expected output. Functional tests are similar to integration tests but call workflow WDLs directly.

Testing

All tasks, subworkflows, and workflows are tested with unit, integration, and functional tests, respectively. We use pytest-workflow and Caper to run Cromwell and compare outputs. Usually we can compare md5sums using the pytest-workflow yml. For example in tests/integration/test_samtools.yml the test_build_fasta_index.wdl is run and we can assert chr22.fa.fai is returned with a certain md5sum:

- name: test_build_fasta_index
  tags:
    - integration
  command: >-
      tests/caper_run.sh \
      tests/integration/wdl/test_build_fasta_index.wdl \
      tests/integration/json/test_build_fasta_index.json
  stdout:
    contains:
      - "samtools faidx chr22.fa"
  files:
      - path: "test-output/chr22.fa.fai"
        md5sum: d89b1fd70e1fe59b329a8346a59ffc03
      - path: "test-output/chr22.fa"
        md5sum: 69a750d5ed9490e0d1e707a43368ba86

However for binary files like BAM and starch that have nondeterministic md5sums we write custom comparisons (kept in the .py files) that first convert the ouputs to plain text. These are automatically run by pytest-workflow after outputs have been generated. For example in tests/integration/test_samtools.yml the test_sort_bam_by_coordinate.wdl is run and we assert sorted.bam is returned but don't check its md5sum:

- name: test_sort_bam_by_coordinate
  tags:
    - integration
  command: >-
      tests/caper_run.sh \
      tests/integration/wdl/test_sort_bam_by_coordinate.wdl \
      tests/integration/json/test_sort_bam_by_coordinate.json
  stdout:
    contains:
      - "samtools sort"
      - "-l 0"
      - "-@ 1"
      - "out.bam"
      - "sorted.bam"
    must_not_contain:
      - "-n"
  files:
      - path: "test-output/sorted.bam"

Then in tests/integration/test_samtools.py we run a Python function that is linked to the test_sort_bam_by_coordinate test and compares the sorted.bam with the expected sorted.bam as SAM files:

import pytest

from pathlib import Path


@pytest.mark.workflow(name='test_sort_bam_by_coordinate')
def test_sort_bam_by_coordinate_bams_match(test_data_dir, workflow_dir, bams_match):
    actual_bam_path = workflow_dir / Path('test-output/sorted.bam')
    expected_bam_path = test_data_dir / Path('dnase/alignment/sorted.bam')
    assert bams_match(actual_bam_path.as_posix(), expected_bam_path.as_posix())

Here test_data_dir is a pytest fixture aliasing the folder where we keep our expected data (i.e. tests/data), workflow_dir is a fixture aliasing the directory in which pytest-workflow runs the WDL and collects the outputs, and bams_match is a fixture aliasing a function that opens both the BAMs as SAMs and compares:

import pysam


def compare_bams_as_sams(bam_path1, bam_path2):
    sam1 = pysam.view(bam_path1)
    sam2 = pysam.view(bam_path2)
    return sam1 == sam2

The tests can be run by calling pytest in the parent dnase-seq-pipeline folder.

Tests dependencies:

Java
Docker
pip install requirements.txt
conda install bedops or brew install bedops (https://bedops.readthedocs.io/en/latest/content/installation.html)

The script that runs Caper expects some environment variables:

DNASE_DOCKER_IMAGE_TAG - docker image to run tests (e.g. encodedcc/dnase-seq-pipeline:template)
CROMWELL - path to cromwell.jar (download from https://github.com/broadinstitute/cromwell/releases)
WOMTOOL - path to womtool.jar (download from https://github.com/broadinstitute/cromwell/releases)

ENCODE-DCC/dnase-seq-pipeline