An analysis pipeline for single-end ChIP-seq data with the following major steps:
- 3' adapter and quality trimming with cutadapt
- alignment with bowtie2
- summaries of quality statistics from FastQC
- summaries of library processing statistics
- fragment size estimation and peakcalling with MACS2
- generation of coverage tracks representing raw data, fragment midpoints, and estimated fragment protection
- library size and spike-in normalization of coverage
- genome-wide scatterplots and correlations
- data visualization (heatmaps and metagenes, with the option to separate data into clusters of similar signal)
- Unix-like operating system (tested on CentOS 7.2.1511)
- Git
- conda
- build-annotations pipeline
-
Unpaired FASTQ files of ChIP-seq libraries. FASTQ files should be demultiplexed, with 5' inline barcodes trimmed. A separate pipeline for demultiplexing unpaired FASTQ files with 5' inline barcodes can be found here. This pipeline has only been tested using Illumina sequencing data.
-
FASTA files:
- the 'experimental' genome
- if any samples have spike-ins:
- the spike-in genome
-
BED6 format annotation files:
- optional: annotations for data visualization (i.e. heatmaps and metagenes)
0. If you need to demultiplex and trim your FASTQ files, use the separate 'demultiplex-single-end' pipeline to do so.
1. If you haven't already done so, clone the separate 'build-annotations' pipeline, make a copy of the config_template.yaml
file called config.yaml
, and edit config.yaml
as needed so that it points to the experimental genome FASTA file.
# clone the repository
git clone https://github.com/winston-lab/build-annotations.git
# move into the build-annotations pipeline directory
cd build-annotations
# make a copy of the configuration template file
cp config_template.yaml config.yaml
# edit the configuration file
vim config.yaml # or use your favorite editor
2. Clone this repository.
git clone https://github.com/winston-lab/chip-seq.git
3. Create and activate the snakemake_default
virtual environment for the pipeline using conda. The virtual environment creation can take a while. If you've already created the snakemake_default
environment from another one of my pipelines, this is the same environment, so you can skip creating the environment and just activate it.
# navigate into the pipeline directory
cd chip-seq
# create the snakemake_default environment
conda env create -v -f envs/snakemake_default.yaml
# activate the environment
source activate snakemake_default
# to deactivate the environment
# source deactivate
4. Make a copy of the configuration file template config_template.yaml
called config.yaml
, and edit config.yaml
to suit your needs.
# make a copy of the configuration template file
cp config_template.yaml config.yaml
# edit the configuration file
vim config.yaml # or use your favorite editor
5. With the snakemake_default
environment activated, do a dry run of the pipeline to see what files will be created.
snakemake -p --use-conda --dryrun
6. If running the pipeline on a local machine, you can run the pipeline using the above command, omitting the --dryrun
flag. You can also use N cores by specifying the --cores N
flag. The first time the pipeline is run, conda will create separate virtual environments for some of the jobs to operate in. Running the pipeline on a local machine can take a long time, especially for many samples, so it's recommended to use an HPC cluster if possible. On the HMS O2 cluster, which uses the SLURM job scheduler, entering sbatch slurm_submit.sh
will submit the pipeline as a single job which spawns individual subjobs as necessary. This can be adapted to other job schedulers and clusters by adapting slurm_submit.sh
, which submits the pipeline to the cluster, slurm_status.sh
, which handles detection of job status on the cluster, and cluster.yaml
, which specifies the resource requests for each type of job.