Institut Curie - Nextflow ChIP-seq analysis pipeline
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with containers making installation trivial and results highly reproducible. The current workflow was originally initiated from the nf-core ChIP-seq pipeline, and was then updated and adapted with addition steps and options.
Despites, this pipeline is able to analyse ChIP-seq data for transcription factor binding sites detection and histone modifications (active or repressive marks). It can run with or without input controls, and with or without spike-in data.
- Run quality control of raw sequencing reads (
fastqc
) - Align reads on reference genome (
BWA
/Bowtie2
/STAR
)- If spike-in are used, mapping on spike genome is run and ambiguous reads are removed from both BAM files (
pysam
)
- If spike-in are used, mapping on spike genome is run and ambiguous reads are removed from both BAM files (
- Sort aligned reads (
SAMTools
) - Mark duplicates (
Picard
) - Library complexity analysis (
Preseq
) - Filtering aligned BAM files (
SAMTools
&BAMTools
)- reads mapped to blacklisted regions
- reads marked as duplicates
- reads that arent marked as primary alignments
- reads that are unmapped
- reads mapped with a low mapping quality (multiple hits, secondary alignments, etc.)
- Computing Normalized and Relative Strand Cross-correlation (NSC/RSC) (
phantompeakqualtools
) - Diverse alignment QCs and bigWig file creation (
deepTools
)- If spike-in are used, a scaling factor is computed and additional bigWig are generated (
DESeq2
)
- If spike-in are used, a scaling factor is computed and additional bigWig are generated (
- Peak calling for sharp, broad peaks and very-broad peaks (
MACS2
) and very broad peaks (epic2
) - Feature counting for every sample at gene and transcription start sites loci (
featureCounts
) - Calculation of Irreproducible Discovery Rate in case of multiple replicates (
IDR
) - Peak annotation and QC (
HOMER
) - Results summary (
MultiQC
)
N E X T F L O W ~ version 20.01.0
======================================================================
Chip-seq v1.0.3
======================================================================
Usage:
nextflow run main.nf --reads '*_R{1,2}.fastq.gz' -profile conda --genomeAnnotationPath '/data/annotations/pipelines' --genome 'hg19'
nextflow run main.nf --samplePlan 'sample_plan.csv' --design 'design.csv' -profile conda --genomeAnnotationPath '/data/annotations/pipelines' --genome 'hg19'
Mandatory arguments:
--reads [file] Path to input data (must be surrounded with quotes)
--samplePlan [file] Path to sample plan file if '--reads' is not specified
--genome [str] Name of genome reference. See the `--genomeAnnotationPath` to defined the annotations path
-profile [str] Configuration profile to use. Can use multiple (comma separated)
Inputs:
--design [file] Path to design file for downstream analysis
--singleEnd [bool] Specifies that the input is single end reads. Default: false
--fragmentSize [int] Estimated fragment length used to extend single-end reads. Default: 200
--spike [str] Name of the genome used for spike-in analysis. Default: false
--genomeAnnotationPath [dir] Path to genome annotation folder
Annotation: If not specified in the configuration file or you wish to overwrite any of the references given by the --genome field
--fasta [file] Path to Fasta reference
--spikeFasta [file] Path to Fasta reference for spike-in
--geneBed [file] BED annotation file with gene coordinate.
--gtf [file] GTF annotation file. Used in HOMER peak annotation
--effGenomeSize [int] Effective Genome size
Alignment: If you want to modify default options or wish to overwrite any of the indexes given by the --genome field
--aligner [str] Alignment tool to use ['bwa-mem', 'star', 'bowtie2']. Default: 'bwa-mem'
--saveAlignedIntermediates [bool] Save all intermediates mapping files. Default: false
--starIndex [dir] Index for STAR aligner
--spikeStarIndex [dir] Spike-in Index for STAR aligner
--bwaIndex [file] Index for Bwa-mem aligner
--spikeBwaIndex [file] Spike-in Index for Bwa-mem aligner
--bowtie2Index [file] Index for Bowtie2 aligner
--spikeBowtie2Index [file] Spike-in Index for Bowtie2 aligner
Filtering:
--mapq [int] Minimum mapping quality to consider. Default: 10
--keepDups [bool] Do not remove duplicates afer marking. Default: false
--blacklist [file] Path to black list regions (.bed). See the genome.config for details.
--spikePercentFilter [float] Minimum percent of reads aligned to spike-in genome. Default: 0.2
Analysis:
--noReadExtension [bool] Do not extend reads to fragment length. Default: false
--tssSize [int] Distance (upstream/downstream) to transcription start point to consider. Default: 2000
Skip options: All are false by default
--skipFastqc [bool] Skips fastQC
--skipPreseq [bool] Skips preseq QC
--skipPPQT [bool] Skips phantompeakqualtools QC
--skipDeepTools [bool] Skips deeptools QC
--skipPeakcalling [bool] Skips peak calling
--skipPeakanno [bool] Skips peak annotation
--skipIDR [bool] Skips IDR QC
--skipFeatCounts [bool] Skips feature count
--skipMultiQC [bool] Skips MultiQC step
Other options:
--outDir [dir] The output directory where the results will be saved
-w/--work-dir [dir] The temporary directory where intermediate data will be saved
-name [str] Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
=======================================================
Available Profiles
-profile test Run the test dataset
-profile conda Build a new conda environment before running the pipeline. Use `--condaCacheDir` to define the conda cache path
-profile multiconda Build a new conda environment per process before running the pipeline. Use `--condaCacheDir` to define the conda cache path
-profile path Use the installation path defined for all tools. Use `--globalPath` to define the insallation path
-profile multipath Use the installation paths defined for each tool. Use `--globalPath` to define the insallation path
-profile docker Use the Docker images for each process
-profile singularity Use the Singularity images for each process. Use `--singularityPath` to define the insallation path
-profile cluster Run the workflow on the cluster, instead of locally
The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow
See the conf/test.conf to set your test dataset.
nextflow run main.nf -profile test,conda
nextflow run main.nf --samplePlan MY_SAMPLE_PLAN --design MY_DESIGN --genome 'hg19' --genomeAnnotationPath ANNOTATION_PATH --outDir MY_OUTPUT_DIR
By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH
variable.
In addition, we set up a few profiles that should allow you i/ to use containers instead of local installation, ii/ to run the pipeline on a cluster instead of on a local architecture. The description of each profile is available on the help message (see above).
Here are a few examples of how to set the profile option.
## Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)
-profile path --globalPath INSTALLATION_PATH
## Run the pipeline on the cluster, using the Singularity containers
-profile cluster,singularity --singularityPath SINGULARITY_PATH
## Run the pipeline on the cluster, building a new conda environment
-profile cluster,conda --condaCacheDir CONDA_CACHE
A sample plan is a csv file (comma separated) that list all samples with their biological IDs. The sample plan is expected to be created as below :
SAMPLE_ID | SAMPLE_NAME | FASTQ_R1 [Path to R1.fastq file] | FASTQ_R2 [For paired end, path to Read 2 fastq]
A design control is a csv file that list all experimental samples, their IDs, the associated input control (or IgG), the replicate number and the expected peak type. The design control is expected to be created as below :
SAMPLE_ID | CONTROL_ID | SAMPLE_NAME | GROUP | PEAK_TYPE
Both files will be checked by the pipeline and have to be rigorously defined in order to make the pipeline work.
Note that the control is optional if not available but is highly recommanded.
If the design
file is not specified, the pipeline will run until the alignment, QCs and track generation. The peak calling and the annotation will be skipped.
- Installation
- Reference genomes
- Running the pipeline
- Output and how to interpret the results
- Troubleshooting
This pipeline has been written by the bioinformatics platform of the Institut Curie (Valentin Laroche, Nicolas Servant)
For any question, bug or suggestion, please use the issues system or contact the bioinformatics core facility.