Last updated: July 31, 2020. Please check our preprint on biorxiv and our published paper on Bioinformatics.
- Overview
- Biology significance of CloudASM
- Main steps in the CloudASM pipeline
- Bioinformatics packages used in CloudASM
- If you are new to Google Cloud Computing
- Installation
- How to use the pipeline
- Test the pipeline
- Prepare the fastq files to be analyzed
- Re-run failed jobs
CloudASM is a turnkey pipeline designed to call allele-specific CpG methylation in whole methylomes. It is designed to run on Google Cloud Platform (GCP).
This pipeline takes as an input zipped fastq files and outputs a table of all single nucleotide polymorphisms (SNPs) with allele-specific methylation. Below, we show an example of the output table:
chr | snp_id | snp_pos | asm_snp | asm_region_inf | asm_region_sup | nb_ref_reads | nb_alt_reads | asm_region_effect | wilcoxon_corr_pvalue | nb_cpg | nb_sig_cpg | nb_pos_sig_cpg | nb_neg_sig_cpg | nb_consec_pos_sig_asm | nb_consec_neg_sig_asm |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | rs1009940 | 9278142 | true | 9278133 | 9278266 | 15 | 15 | -0.494 | 0.00322 | 5 | 4 | 1 | 3 | 0 | 2 |
1 | rs10127939 | 161518333 | true | 161518283 | 161518415 | 24 | 17 | -0.297 | 0.03277 | 6 | 3 | 0 | 3 | 0 | 2 |
1 | rs10157041 | 161397958 | true | 161397867 | 161397946 | 21 | 35 | -0.426 | 0.00275 | 12 | 8 | 0 | 8 | 0 | 6 |
Here are the explanations of the different parameters on this table:
chr
: this this the chromosome number where the SNP is locatedsnp_id
: the unique identifier for the SNP that was evaluated for ASM.snp_pos
: Coordinate of the SNP.asm_snp
: whether the SNP has has ASM or not.asm_region_inf
: Position of the CpG with significant ASM for the SNPsnp_id
and the smallest coordinate.asm_region_sup
: Position of the CpG with significant ASM for the SNPsnp_id
and the smallest coordinate.nb_ref_reads
: Number of genomic segments that cover the REF of the SNP.nb_alt_reads
: Number of genomic segments that cover the ALT of the SNP.asm_region_effect
: The difference between fractional methylation between the ALT and REF genomics segments.wilcoxon_corr_pvalue
: Wilcoxon p-value of the theasm_region_effect
corrected for multiple testing (across all SNPs).nb_cpg
: Number of CpGs with at least 5x coverage on both the REF and ALT genomic segments.nb_pos_sig_cpg
: Number of CpGs with significant ASM and where fractional methylation between ALT and REF is positive.nb_neg_sig_cpg
: Number of CpGs with significant ASM and where fractional methylation between ALT and REF is negative.nb_consec_pos_sig_asm
: Number of consecutive CpGs with significant ASM and where fractional methylation between ALT and REF is positive.neg_consec_sig_cpg
: Number consecutive of CpGs with significant ASM and where fractional methylation between ALT and REF is negative.
See below how these look like on an example.
Our laboratory has a long-standing expertise in studying allele-specific methylation. To make sure our pipeline avoids outputing false positives, we have implemented the following steps for stringency:
- we filter out the reads where the confidence in the SNP nucleotide is lower than 30 (variable
SNP_SCORE
) - we remove CpGs from the context file where the C or G overlap with a SNP found in the unfiltered list of variants identified by BisSNP
- We do not consider for ASM the SNPs that are not within 500 bp of at least a CpG
As described in the figure above, to catch a "true positive" phenomenon of allele-specific methylation, we use two types of calculation:
- single CpG level ASM ("CpG ASM") where we estimate if there is ASM on a CpG that is at least 5x covered on both alleles. We use a cut-off p-value of 0.05 on an exact Fisher's test.
- ASM over a region delimited by CpGs that are found on the same reads as the SNP. The region is delimited by two CpGs showing ASM, must contain at least 3 CpGs (variable
CPG_PER_asm_region
), must have at least 2 consecutive CpGs with ASM in the same direction (variableCONSECUTIVE_CPG
), must have an effect size of at least 20% between the two allele (variableASM_REGION_EFFECT
). The effect size is the difference in methylation percentage between the two alleles calculated across all CpGs in the asm_region. We use a p-value of 0.05 on a Wilcoxon test.
All these variables can be adjusted by the user.
The pipeline follows these steps:
- Create a bisulfite-converted genome from the reference genome that was chosen
- Unzip fastq files and split them into smaller files ("chards").
- Trim each pair of fastq files
- Align each pair of fastq files
- Merge BAM files per chromosome and prepare them for variant and net methylation calling.
- Net methylation calling
- Re-calibrate of BAM files
- Variant calling
- Allele-specific methylation calling
Note that the user can choose the reference genome to align the bisulfite-converted reads. The script automatically fetches the correct database of variants for the reference genome that is selected.
- bowtie2-2.1.0
- samtools-0.1.19
- bedtools2
- Bismark_v0.19.0
- cutadapt-1.4.2
- TrimGalore-0.4.5
- picard-tools-1.97
- jre1.7.0_25
- BisSNP-0.82.2
All these packages are included in the Docker-generated publicly-available image gcr.io/hackensack-tyco/wgbs-asm
. Note that we need to use a specific version of JAVA to be able to run BisSNP.
Google Cloud Computing (GCP) is a service for cloud computing, just like Amazon Web Services and Microsoft Azur Cloud Compute.
To be able use CloudASM, you need to create an account on https://cloud.google.com/ and set up how you want to be billed (e.g. with a credit card). As of January 2020, GCP offers $300 in credits for opening an account -- which is enough to test this pipeline.
GCP offers a suite of services and CloudASM uses Compute Engine, where virtual machines can be created and used for CloudASM's tasks, Cloud Storage, where data is stored before and after being processed by virtual machines, and Big Query, where the aligned reads, variants, and methylation states are analyzed jointly to estimate ASM.
Once you open an account on GCP, you need to create a "project" within GCP and choose which geographical "region" and "zone" you want to request resources from. It is recommended to pick a region and zone near your location. Every job you submit within your project will pull resources from the region and zone you chose.
Very basically, data is stored in "buckets" in the module Cloud Storage
. When CloudASM executes a batch of jobs, virtual machines (also called "instances") in the module Compute Engine
are created to execute the job. They download the data from the bucket, obtain the script from CloudASM, execute the script, and upload the output of the script back in the bucket (or BigQuery). Depending on the job, CloudASM requests instances with 2-16 CPUs, adds larger disks to these instances (30-400 GB), and executes the commands on one of the Docker images we built or that were already available.
When you start running CloudASM on more than one sample, pipeline manager dsub will start queuing jobs in the same batch if you do not have enough resoures for your project. When you start running CloudASM, you may want to go to the Quotas and monitor which resources you need to increase.
To run CloudASM, you need to install GCP's pipeline manager called "dsub". We recommend using the method where their repository is cloned in combination with the virtual environment. You can also install dsub via conda but this was not created by the dsub creators so it may not have the latest version of dsub.
The current version of CloudASM was validated with dsub 0.3.7.
-
Prepare the zipped fastq files to be analyzed (see below how)
-
Clone this repository on your computer
-
Customize the Library parameters, the ASM parameters and the GCP parameters in
master.sh
. -
Launch a virtual environment by typing
python3 -m venv dsub_libs
and thensource dsub_libs/bin/activate
(does not matter in which directory you are located). -
Copy, sequentially, all instructions from
master.sh
into the terminal, block by block (a "block" is a set of instructions included in between two headlines). -
Before moving on to the next instructions block, re-run the jobs that fail if you use preemptible machines (failure rate is about 5-10% when using preemptible machines). See below how to do that.
We prepared a small dataset (~24MB of zipped fastq files) for anyone to test CloudASM without incurring large costs. The dataset was prepared by recombining bisulfite-converted reads overlapping the positions 9,000,000 and 10,000,000 on chromosome 1, using the lymphoblastoid cell line GM12878 (identifier: ENCSR890UQO) made publicly available by the ENCODE consortium.
The zipped fastq files are freely accessible on our GCP’s storage space. Using this dataset, CloudASM assessed 456 SNPs and found 13 ASM regions. All the data generated by CloudASM for this dataset is stored here. The final table of all variants and their assessment for ASM can be downloaded here.
To test the pipeline, you will have to change all GCP parameters except the variable INPUT_B
. As you will when running the pipeline, for each sample, CloudASM creates the following folders:
split_fastq
: fastq files unzipped and split into 120,000 rows chardstrimmed_fastq
: trimmed pairs of fastq chards fromsplit_fastq
aligned_per_chard
: BAM files created from aligning trimmed chards of fastq files fromtrimmed_fastq
bam_per_chard_and_chr
: The BAM files fromaligned_per_chard
are split across chromosomesbam_per_chr
: The BAM files frombam_per_chard_and_chr
are merged across chromosomesnet_methyl
: context files created from calling net methylation on the BAM files located inbam_per_chr
.recal_bam_per_chr
: Recalibrated BAM files frombam_per_chr
variants_per_chr
: VCF files for each chromosomes created from the BAM files inrecal_bam_per_chr
.asm
: Tables of CpG-level and region-level of ASM.bedgraph
: bedgraph files of coverage and methylation percentage across the whole samplesam
: SAM files created from BAM files inrecal_bam_per_chr
.snps_for_cpg
: database of SNPs used to remove CpG sites potentially overlapping with a variant.
When naming your samples, do not use dashes. Only use underscores.
Prepare a sample info file using this model. Download the file as TSV file into your run_files
directory located where you cloned the CloudASM repository (run_files
is automatically created by the master
script). The sample info file looks like the table below. The first column is the name of the sample, the 2nd column is the list of all the zipped fastq files, the 3rd column is the lane of the zipped fastq files, the 4th column is the read number of the zipped fastq file. The 4th column is created automatically from column 1, 3, and 4.
sample | bucket_url | lane_id | read_id | file_new_name |
---|---|---|---|---|
gm12878 | gs://encode-wgbs/gm12878/ENCFF113KRQ.fastq.gz | L01 | R2 | gm12878_L01.R2.fastq |
gm12878 | gs://encode-wgbs/gm12878/ENCFF585BXF.fastq.gz | L02 | R1 | gm12878_L02.R1.fastq |
gm12878 | gs://encode-wgbs/gm12878/ENCFF798RSS.fastq.gz | L01 | R1 | gm12878_L01.R1.fastq |
gm12878 | gs://encode-wgbs/gm12878/ENCFF851HAT.fastq.gz | L02 | R2 | gm12878_L02.R2.fastq |