/chip-seq

ChIP-seq analysis pipeline

Primary LanguageRGNU General Public License v3.0GPL-3.0

ChIP-seq analysis pipeline

description

An analysis pipeline for single-end ChIP-seq data with the following major steps:

  • 3' adapter and quality trimming with cutadapt
  • alignment with bowtie2
  • summaries of quality statistics from FastQC
  • summaries of library processing statistics
  • fragment size estimation and peakcalling with MACS2
  • generation of coverage tracks representing raw data, fragment midpoints, and estimated fragment protection
  • library size and spike-in normalization of coverage
  • genome-wide scatterplots and correlations
  • data visualization (heatmaps and metagenes, with the option to separate data into clusters of similar signal)

requirements

required software

required files

  • Unpaired FASTQ files of ChIP-seq libraries. FASTQ files should be demultiplexed, with 5' inline barcodes trimmed. A separate pipeline for demultiplexing unpaired FASTQ files with 5' inline barcodes can be found here. This pipeline has only been tested using Illumina sequencing data.

  • FASTA files:

    • the 'experimental' genome
    • if any samples have spike-ins:
      • the spike-in genome
  • BED6 format annotation files:

    • optional: annotations for data visualization (i.e. heatmaps and metagenes)

instructions

0. If you need to demultiplex and trim your FASTQ files, use the separate 'demultiplex-single-end' pipeline to do so.

1. If you haven't already done so, clone the separate 'build-annotations' pipeline, make a copy of the config_template.yaml file called config.yaml, and edit config.yaml as needed so that it points to the experimental genome FASTA file.

# clone the repository
git clone https://github.com/winston-lab/build-annotations.git

# move into the build-annotations pipeline directory
cd build-annotations

# make a copy of the configuration template file
cp config_template.yaml config.yaml

# edit the configuration file
vim config.yaml         # or use your favorite editor

2. Clone this repository.

git clone https://github.com/winston-lab/chip-seq.git

3. Create and activate the snakemake_default virtual environment for the pipeline using conda. The virtual environment creation can take a while. If you've already created the snakemake_default environment from another one of my pipelines, this is the same environment, so you can skip creating the environment and just activate it.

# navigate into the pipeline directory
cd chip-seq

# create the snakemake_default environment
conda env create -v -f envs/snakemake_default.yaml

# activate the environment
source activate snakemake_default

# to deactivate the environment
# source deactivate

4. Make a copy of the configuration file template config_template.yaml called config.yaml, and edit config.yaml to suit your needs.

# make a copy of the configuration template file
cp config_template.yaml config.yaml

# edit the configuration file
vim config.yaml    # or use your favorite editor

5. With the snakemake_default environment activated, do a dry run of the pipeline to see what files will be created.

snakemake -p --use-conda --dryrun

6. If running the pipeline on a local machine, you can run the pipeline using the above command, omitting the --dryrun flag. You can also use N cores by specifying the --cores N flag. The first time the pipeline is run, conda will create separate virtual environments for some of the jobs to operate in. Running the pipeline on a local machine can take a long time, especially for many samples, so it's recommended to use an HPC cluster if possible. On the HMS O2 cluster, which uses the SLURM job scheduler, entering sbatch slurm_submit.sh will submit the pipeline as a single job which spawns individual subjobs as necessary. This can be adapted to other job schedulers and clusters by adapting slurm_submit.sh, which submits the pipeline to the cluster, slurm_status.sh, which handles detection of job status on the cluster, and cluster.yaml, which specifies the resource requests for each type of job.