This repository contains a pipeline for the analysis of the Transcription Factor ChIP-seq (TF ChIP-seq) data.
Motivation
The aim was to create a standard TF ChIP-seq pipeline that included quality controls, data analysis and visualisation. The pipeline has been developed using Snakemake, a tool to create reproducible and scalable data analyses.
It was primed by Bruce for its Bachelor's project, and then improved by Federico to become a tool than could be applied to any set of ChIP-seq experiments.
Requirements
- Snakemake with Python >= 3.6 (
os
,sys
,re
,pysam
anditertools
modules) - SRA Toolkit
- FastQC
- Trim Galore! and its dependency Cutadapt
- Bowtie2
- MACS2
This illustrative case employs data from Franco HL, et al., 2015. For the sake of time, it focuses only on FoxA1 Vehicle ChIP-seq and its control samples without considering any of the treatments.
The Sequence Read Archives (SRA) files can be retrieved from GEO: GSE59530.
Since it is a bit tricky to provide remote files as input in Snakemake, the original data is downloaded externally using curl
and the SraAccList.txt
file present in the main folder.
# Create the output folder
mkdir -p data
# Cycle through the identifiers in 'SraAccList.txt' and download the corresponding file
for sra in `awk '{print $1}' SraAccList.txt`; do
curl "ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/${sra:0:6}/$sra/$sra.sra" --output data/$sra
done
Once the download is finished, open the configuration file ('config.yaml') in the main folder and set the correct paths. If they do not exists, some of the paths (i.e., results and bowtie2) will be generated during the first run.
Finally, you can type the following command to execute the whole pipeline:
snakemake
In case you would like to perform only the alignment to the genome (no peak calling), you can run:
snakemake map
[Optional] Directed acyclic graph (DAG)
The DAG, which shows the computational steps that each sample is going through, can be generated using:
snakemake dag
or you can generate it for the whole workflow with:
snakemake dag_complete
Figure 1: Schematic of the pipeline workflow.
- Quality control on the raw data
- Performed with FastQC
- Trimming of the adapter sequence
- Performed with Trim Galore!
- Alignment to the genome
- Performed with Bowtie2
- Calculation of the non-redundant fraction (NRF) metric
- Generation of bigWig for visualisation (NOT YET!)
- Performed using the ENCODE binaries
- Peak calling
- Performed using Macs2
- Visualisation of the results (Bruce forgot to commit this one...)
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing).
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
This custom script calculates Non-Redundant Fraction (NRF = Unique start positions of uniquely mappable reads / Uniquely mappable reads).
Strand cross-correlation is computed as the Pearson correlation between the positive and the negative strand profiles at different strand shift distances, k.
MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with control sample with the increase of specificity.
NB: Presently, the algorithm performs the peak calling using a control (must have 'Control' in the sample name) and treatment (everything else).
Calculate the Irreproducible Discovery Rate (IDR).
- Cluster support (will be done very soon!)
- BigWig coverage files
- Matching function for conditions/controls
- Paired-end reads support