full-chipseq

A snakemake pipeline to process single end human ChIP-seq data on a linux machine.

This pipeline will allow you to do the following:

Quality control of your samples
Alignment and peak calling
Generate a hub to visualize your samples on the UCSC genome browser
Fonctionnal annotation of your peaks
Motif Finding TODO
Super enhancer TODO

This pipeline is based on the pyflow-ChIP-seq from the talented Ming Tang. I repackaged most of the steps, added some, deleted some, created new conda environments and packaged all QC outputs in multiQC.

This pipeline should be easy to install with snakemake and conda present on the system. At the moment it support running on a sge cluster.

Installation

Prerequisite

Anaconda installed
Snakemake installed
Please install mamba using "conda install -c conda-forge mamba" as it is way faster to install the snakemake environment. Mamba is used in the install_dependencies.sh script

Steps

On a linux terminal, please run the following commands :

git clone https://gitlab.univ-nantes.fr/foucal-a/full-chipseq.git
cd full-chipseq
./install_dependencies

Those commands will clone the git directory, move into the folder and install a few things :

UCSC executables like bedTbigBed
Conda environments
Genomic data (for homo sapiens only at the moment, with hg19)

Preparing to launch the pipeline

To launch the pipeline you will need :

One json file that details where the samples can be found and what are the marks. An example can be found in "samples_from_meta.json" This file can be created manually or you can use a meta file to be used with the sample2json script to create your json file this way : python sample2json.py --meta meta.txt --fastq_dir directory

where directory contains all the fastq.gz needed

One config files detailling all the different options and data needed to launch the pipe. One example is provided as config_example.yaml in the config directory

Launching the pipeline

snakemake --snakefile snakefile.smk --use-conda --cluster "sleep 1s && qsub -q max-7d.q" --jobs 50 --jobscript job.sh --configfile config.yaml

Inspecting the output

Delving into the pipeline : details of the rules

Alignement and filtering steps

rule merge_fastqs: Merging of fastq files together into unzipped fastq. To save disk space, the fastq output is temporary.
rule align: Alignement using bowtie2, with bowtie index to be provided in the parameters. Then samblaster is used to mark duplicates, followed by removing of patches and alternate loci. To save disk space, the bam output is temporary. This steps has been added to have an unfiltered bam file to compute some QC metrics.
rule filter_alignment: Alignment filtering removing blacklist regions, reads below the mapping quality threshold specified in the parameter file and duplicate removal, followed by sorting. Blacklist regions are downloaded with the install_dependencies scripts from their github repo. Thread on why it's still best practise to remove blacklist regions : https://www.biostars.org/p/184537/ and https://www.biostars.org/p/361297/ . Filtering is done with -F 1804 to get uniquely mapped reads, PCR duplicate removed and QC passed.
rule create_bam_json: simple rule to create a json file integrating the filtered bam. When starting the pipeline with the bam option = True and this json as a sample_json it avoids having to realign the fastqs.
rule index_bam: Indexing filtered bams with samtools index
rule flagstat_bam: Flagstat is used to get the read numbers for each samples
rule down_sample: filtered bams are downsampled to the number of specified reads in the parameters files. You can put an arbitrary large number to avoid downsampling (i.e. 200000000). The rationale for downsampling is from the ENCODE Roadmap paper. The blog from Ming Tang also highlights some methods and excerpt from the paper. This is probably not usefull unless you have replicates since you prolly won't reach the recommanded 30 millions reads. The pipeline does not support merging replicates at the moment.

QC steps

The metrics are described in more details in the QC part of the documentation

rule encode_complexity: This will compute the NRF, PCB1 and PBC2 using the beds transformed from the raw bams with bedtools bamtobed. It then uses an awk command from the ENCODE_DCC ATAC SEQ pipeline to compute the metrics.
rule fastqc: fastqc tools on the merged fastqc files.
rule phantom_peak_qual: phantompeakqualtools is used to compute its metrics and the fragment length for MACS2 peak calling.
rule computeMatrix_QC: This comes from the deepTools suite of tools. A necessary step to compute plotProfile and plotHeatmap. The mode used here is reference point, using the Transcription Start Site (TSS) from genes downloaded from the ensemble annotation. Youc an use anything you want here, just put in the the GENOME_TSS part of the config file.
rule plotHeatmap: plotHeatmap from deepTools, centered around the GENOME_TSS points, for each mark or TF. The profile on top of the heatmap has a scale problem at the moment.
rule plotProfile: deepTools plotProfile. You'll find everything together in the multiQC report.
rule plotFingerPrint: deepTools plotFingerPrint. This gives you all the metrics for the chromosome 1. There is one plot per sample with all its marks or TF, but you'll find everything together in the multiQC report.
rule all_multiBigwigSummary and all_plotCorrelation: Using the multibigwigsummary of deeptools for chr1 to compute the correlation between of read counts between all samples.
rule get_FRiP_for_multiqc: Fraction of reads in peak using the featureCounts module of subRead. This will use the appropriate peaks (narrow or broad) and the filtered bams
rule get_broad_peak_counts_for_multiqc: A simple extraction of broad peak counts with a python script to create custome data for MultiQC.
rule get_narrow_peak_counts_for_multiqc: simple extraction of narrow peak counts with a python script to create custome data for MultiQC.

Peak calling

Details of the overall peak calling strategy is in the Peak calling strategies

rule call_narrow_peaks_macs2: MACS2 narrow peak calling p-value from the config file macs2_pvalue.
rule call_broad_peaks_macs2: MACS2 broad peak calling with p-value from the config file macs2_pvalue and the broad peak cutoff from the config file macs2_pvalue_broad_cutoff.

Visualization

Those steps will ultimately create a HUB that can be used on the UCSC genome browser

rule get_bigwigs_using_inputs: Sample vs INput bigWig file using bamCompare. This will create a bigwig signal file in log2 fold change of sample vs input. RPKM is used as a scaling method. binSize is 10 and smoothLentgh is 30. SES scaling might be more appropriate.
rule get_bigwigs: Single sample bigWig signal file. RPKM is used as a scaling method. binSize is 10 and smoothLentgh is 30.
rule get_bigbeds: Uses MACS2 peak file to create bigbeds file with UCSC scripts bedToBigBed.
rule get_UCSC_hub: Create a HUB with all peaks, sample vs input bigwig. Uses a python package from Ryan Dale.

MultiQC

rule multiQC_config: Copying the multiQC config file in the config output folder.
rule multiQC: Using phantompeakqualtools, deepTools, featureCOunts, custome peaks counting, ENCODE, fastqc and BOWTIE logs to create the multiQC report. If BAM im=nput is true some rules wont be executed : FASTQC BOWTIE and ENCODE.

Annotation

rule homer_annotate: Annotate the peak files with HOMER. It uses the annotation file in the config file genome_gtf and genome_fasta.

ChromHMM

This will only be laucnhed if the ChromHMM flag in the config fil is set to True

rule bam2bed: Convert filtered bam into bed files using bedtools bamtobed.
rule make_table: Create the cellmarkfiletable.txt necessary for ChromHMM using some small python commands and the dictionaries of samples.
rule chromHMM_binarize: Binarize steps of ChromHMM. This will use the binsize in the config file, a fixed 32G of memory for Java and the list of chromosomes without patches and alternate loci.
rule chromHMM_learn: LearnModel steps of ChromHMM. This will use the binsize in the config file, a fixed 32G of memory for Java and the list of chromosomes without patches and alternate loci.

Peak calling strategies

The tool of choice here is MACS. Parameters are controlled by the p-value and for broad calls, the broad cutoff threshold. MACS2 ouputs will give you excel ouputs that will allow to select on the q value if needed. For the moment, there is only individual calls

QC

A slew of variables are produced by the pipeline to help you assess the quality of your samples. They are summarized in the multiQC report.

ENCODE metrics

They are summarized here The ENCODE consortium published guidelines in the following paper. This is a must read for anyone processing ChIP-Seq data. Here I quickly recapitulate ENCODE metrics:

Non-Redundant Fraction (NRF) :

$NRF=\frac{\text{Nunique}}{\text{Ntotal}}$

where:
- Nunique: Number of distinct uniquely mapping reads (i.e. after removing duplicates)
- Ntotal: Total number of reads
PCR Bottlenecking Coefficient 1 (PBC1):

$PBC1=\frac{M1}{MDISTINCT}$

where:
- M1: the number of genomic locations where exactly one read maps uniquely
- MDISTINCT: the number of distinct genomic locations to which some read maps uniquely
PCR Bottlenecking Coefficient 1 (PBC2):

$PBC2=\frac{M1}{M2}$

where:
- M1: number of genomic locations where only one read maps uniquely
- M2: number of genomic locations where two reads map uniquely

Those are all computed by an awk command on bam file transformed into a bed file by the UCSC utility bamtobed

phantompeakqualtools metrics

From ENCODE and the phantompeakqualtool software

Normalized Strand Cross-correlation coefficient (NSC): A measure of enrichment derived without dependence on prior determination of enriched regions. Forward and reverse strand read coverage signal tracks are computed (number of unique mapping read starts at each base in the genome on the + and - strand counted separately). The forward and reverse tracks are shifted towards and away from each other by incremental distances and for each shift, the Pearson correlation coefficient is computed. In this way, a cross-correlation profile is computed, representing the correlation between forward and reverse strand coverage at different shifts. The highest cross-correlation value is obtained at a strand shift equal to the predominant fragment length in the dataset as a result of clustering/enrichment of relative fixed-size fragments around the binding sites of the target factor or feature. The NSC is the ratio of the maximal cross-correlation value (which occurs at strand shift equal to fragment length) divided by the background cross-correlation (minimum cross-correlation value over all possible strand shifts). Higher values indicate more enrichment, values less than 1.1 are relatively low NSC scores, and the minimum possible value is 1 (no enrichment). This score is sensitive to technical effects; for example, high-quality antibodies such as H3K4me3 and CTCF score well for all cell types and ENCODE production groups, and variation in enrichment in particular IPs is detected as stochastic variation. This score is also sensitive to biological effects; narrow marks score higher than broad marks (H3K4me3 vs H3K36me3, H3K27me3) for all cell types and ENCODE production groups, and features present in some individual cells, but not others, in a population are expected to have lower scores.
Relative Strand Cross-correlation coefficient (RSC): A measure of enrichment derived without dependence on prior determination of enriched regions. Forward and reverse strand read coverage signal tracks are computed (number of unique mapping read starts at each base in the genome on the + and - strand counted separately). The forward and reverse tracks are shifted towards and away from each other by incremental distances and for each shift, the Pearson correlation coefficient is computed. In this way, a cross-correlation profile is computed representing the correlation values between forward and reverse strand coverage at different shifts. The highest cross-correlation value is obtained at a strand shift equal to the predominant fragment length in the dataset as a result of clustering/enrichment of relative fixed-size fragments around the binding sites of the target factor. For short-read datasets (< 100 bp reads) and large genomes with a significant number of non-uniquely mappable positions (e.g., human and mouse), a cross-correlation phantom-peak is also observed at a strand-shift equal to the read length. This read-length peak is an effect of the variable and dispersed mappability of positions across the genome. For a significantly enriched dataset, the fragment length cross-correlation peak (representing clustering of fragments around target sites) should be larger than the mappability-based read-length peak. The RSC is the ratio of the fragment-length cross-correlation value minus the background cross-correlation value, divided by the phantom-peak cross-correlation value minus the background cross-correlation value. The minimum possible value is 0 (no signal), highly enriched experiments have values greater than 1, and values much less than 1 may indicate low quality.

deepTools

deepTools is a suite of tools to process biological data and is especially useful for Chip-Seq. Numerous plots and metrics are computed to give you an idea on the quality of your data.

Transcription Start Site (TSS) Enrichment plots : Give you an indication of enrichment around TSS for your marks or TF. Produced by deeptools using TSS start sites that are specified in the config file. You can provide your own data from the table browsing from ucsc or use the data from the pipeline. The Profile plot is included in the multiQC report.
DeepTools fingerprint metrics : This helps you evaluate the focal enrichment of your mark or TF. It's explained in more details in plotFingerprint.

Metric using peaks called by MACS

Fraction of reads in peaks (FRiP) : Computed using subCount Feature Count on the filtered bams and the appropriate MACS peaks
Counts of broadpeaks from MACS2
Counts of narrow peaks from MACS1

Adrien-Evo/full-atac