Terra/WDL DNA methylation workflows

This repository https://github.com/aryeelab/dna-methylation-tools contains a suite of tools to conduct DNA methylation data analysis. It is maintained by Divy Kangeyan at the Aryee Lab

This platform contains publicly accessible cloud-based preprocessing and quality control pipelines that go from raw data to CpG-level methylation estimates. The technologies covered include Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Hybrid Selection (capture) Bisulfite Sequencing (HSBS). Leveraging the Terra platform allows users to:

Ensure cross-platform reproducibility of analyses
Achieve scalability to large whole genome datasets with 100GB+ of raw data per sample, and to single-cell datasets with thousands of cells
Provide access to best-practice analysis pipelines
Enable integration and comparison between user-provided data and publicly available data (e.g. TCGA)

Workflow steps

Analysis should be run in two successive processes:

Per-sample preprocessing
Aggregation and quality control analysis

Getting started

If you plan to use the WDL workflows in your local computing environment, then this repository should be cloned. If you are using Terra, then you should either clone the Terra workspace aryee-lab/dna-methylation or if you already have your own workspace you can import the method configuration of your interest from aryee-lab/dna-methylation workspace.

Uploading the files to Terra

FASTQ files and target coverage files can be uploaded to Terra using gsutil (https://pypi.org/project/gsutil/)

Data model

Participant file

Before running the processes, you need to generate participants file and participant_set file. Each line in participant file specify a single sample.

entity:participant_id column specifies the sample name

bs_fastq1 & bs_fastq2 specify the paired fastq files.

If one sample has multiple FASTQ files for different lanes or runs they can be added in the same column with comma separation.

Participant Set file

Each line in participant_set file specify a set of samples that should be aggregated and analyzed together.

membership:participant_set_id specify the name of the participant set.

participant_id column specify the names of the participants in the set.

Both of these files are tab separated text files. Examples of these files are shown in Terra_imports subdirectory.

Alignment and methylation calling

In order to perform alignment and methylation calling choose bismark_rrbs, bismark_wgbs or bismark_hsbs method configuration with appropriate reference genome suffix. As the name indicates bismark_rrbs is for samples that are generated from Reduced Representation Bisulfite Sequencing (RRBS) with Mspl digestion and bismark_wgbs is for data generated from Whole Genome Bisulfite Sequencing (WGBS). bismark_hsbs is for data generated from Hybrid Selection Bisulfite Sequencing (HSBS). These worflows can also combine fastq files from multiple lanes if the samples are sequenced in such a way.

Upload the fastq files to the Google cloud bucket
Upload additional files such as target coverage bed file for HSBS sequencing
In the Terra workspace choose bismark_rrbs, bismark_wgbs or bismark_hsbs method configuration with appropriate reference genome suffix
Change other parameters according to preference
Press Launch Analysis in upper right hand corner
Choose the participants from the list of files
Click Launch

If you are interested in conducting alignment and methylation calling for an entire participant set. Choose the participant set and in the box named Define expression type this.participants

You can observe the status of the job by going to Monitor tab

Alignment and methylation calling specific parameters

r1_fastq & r2_fastq

Paired end FASTQ files

samplename

base name for a sample

genome_index

Reference genome to conduct bisulfite specicfic alignment

n_bp_trim_read1 & n_bp_trim_read1

Number of bases to trim in read 1 and read 2. This is only specified for wgbs and hsbs workflows. rrbs workflow uses the --rrbs option from trimGalore.

chrom_sizes

chrom_sizes file in order to generate the BIGWIG file

Aggregation and Quality Control Analysis

After the alignment and methylation calling each sample will have their methylation information and metadata stored in HDF5 format

In order to aggregate all of them and obtain the quality control report

Choose aggregate_bismark_output method configuration with appropriate reference genome suffix
Change other parameters according to preference
Save it and press Launch Analysis
Since root entity type in aggregation step is participant set, you will choose participant_set with participants of your interest
Finally click Launch

To check the results from any of the work flows, go to Monitor tab, click View in the Status columns and then click the Workflow ID in the bottom of the page.

Aggregation specific parameters

in_pe_reports_files

Report files from the bismark alignment

in_covgz_files

coverage output file

in_mbias_files

M-bias file

BSGenome_targz

BSGenome targz file

BSGenome_package

Name of the BSGenome package

Genome_build

Name of the genome build, e.g. mm10, hg19 etc.

General parameters

cpu

Number of CPUs

disks

Name of the disk (Depends on the disk size)

memory

Memory requirement

multicore

Number of cores to use, especially during alignment

preemptible

Preemptible option is to use preemptible virtual machine, if it is set to 0 the workflow runs uninterrupted. If it is set to any integer other than zero it may be interpreted that many times. However preemtible option reduces the computing cost significantly.

WDL specifications for the workflows

The per-sample preprocessing WDL workflow input parameters consist primarily of input file names and the Bismark genome index to use. A subset of the outputs from this step (Mbias files and methylated and unmethylated read coverage tables) are passed on as inputs to the subsequent aggregation and QC WDL workflow which in turn produces a QC report, aggregated methylation tables and a Bioconductor (bsseq) object. All tasks specify runtime parameters including disk size, CPU and memory requirements, and the docker image version to be executed.

scmeth

Quality control analysis is conducted via the bioconductor package scmeth. QC analysis can be done independently with this package given that you have the appropriate objects.

Development and local execution

Setup

In order to push/pull images from the Google container registry you must first authenticate with Google:

gcloud auth login

You then need to run a one-time setup to configure docker to use Google authentication:

gcloud auth configure-docker

Clone this repository

git clone aryeelab/dna-methylation-tools
cd dna-methylation-tools

If you want to run the examples below, download this small genome index and chrom.sizes file for mm10_chr 19:

cd testdata
wget https://storage.googleapis.com/aryeelab/bismark-index/mm10_chr19/bismark_mm10_chr19.tar.gz
wget https://storage.googleapis.com/aryeelab/chrom.sizes/mm10_chr19.chrom.sizes

Running the WDL workflow in Cromwell

Note: On Mac, cromwell can be installed with the following command

brew install cromwell

Users can also use the following command in order to install cromwell from the docker image in this repo.

docker build Docker/cromwell

Edit the file paths for FASTQ files, genome index, chrom.sizes file and monitoring script in the json file according to your directory paths to run the test sample.

{
  "call_bismark_pool.r1_fastq": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/testdata/small_01_R1.fastq.gz",
  "call_bismark_pool.r2_fastq": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/testdata/small_01_R2.fastq.gz",
  "call_bismark_pool.chrom_sizes": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/mm10.chrom.sizes",
  "call_bismark_pool.monitoring_script": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/monitor.sh",
  "call_bismark_pool.genome_index": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/bismark_mm10_chr19.tar.gz"
    
}

WGBS, Paired-end reads

cromwell run -i ../pipelines/bismark_wgbs/bismark_wgbs.json ../pipelines/bismark_wgbs/bismark_wgbs.wdl

Aggregation workflow

This step is only necessary if you want to combine multiple samples into a single Bioconductor bsseq object.

cromwell run ../pipelines/aggregate_bismark_output/aggregate_bismark_output.wdl -i ../pipelines/aggregate_bismark_output/aggregate_bismark.json

Cost & Time Analysis in Terra

We estimated the time and cost for Whole Genome Bisulfite Sequencing (WGBS) sample from TCGA that contains approximately 290 million reads.

Also following plot shows the time & cost estimates for single cell samples in non-preemptible machine

For 10 sample set: samples were sequenced with a median of 741,214 reads (range of 414,443 to 1,825,672)

For 25 sample set: samples were sequenced with a median of 1,044,720 reads (range of 10,507 to 3,644,360)

Plots below are for same sample set but in preemptible machine

We also conducted similar analysis for the aggregation step

TCGA data

We have preprocessed and made available 47 WGBS samples available from TCGA. The raw (FASTQ) data, processed data and workflows are made available in a Terra workspace (See https://portal.firecloud.org/#workspaces/aryee-merkin/TCGA_WGBS_hg19). Processed data can also be found in bsseq format in tcgaWGBSData.hg19 package (https://bioconductor.org/packages/release/data/experiment/html/tcgaWGBSData.hg19.html).

Questions and Comments

Please use the Github Issues tracker with any issue you face with the platform. Any specific questions or comments, contact me at divyswar01@g.harvard.edu

aryeelab/bisulfite-seq-tools