/ChIP-Flow

Nextflow pipeline for ChIP-seq analysis.

Primary LanguageNextflowGNU General Public License v3.0GPL-3.0

Nextflow Implementation of the Dowell Lab ChIP-seq Pipeline

For internal Dowell Lab use.

Usage

Download and Installation

Clone this repository in your home directory:

$ git clone https://github.com/Dowell-Lab/ChIP-Flow.git

Install Nextflow:

$ module load curl/7.49.1 (or set path to curl executable if installed locally)
$ curl -s https://get.nextflow.io | bash

Slurm-Specific Usage Requirements

Primary Run Settings

If you are using Linux, this will install nextflow to your home directory. As such, to run Nextflow, you will need to set the PATH to your home directory. Doing so as the following will set the PATH as a variable so you can still acess other paths (e.g. when you load modules) on your cluster without conflict:

$export PATH=~:$PATH

First and foremost, edit conf/slurm_grch38.config to ensure the proper paths and email address are set (look for all mentions of COMPLETE_*). Variable names should hopefully be self-explanatory. Then:

    $ nextflow run main.nf  -profile slurm_grch38 --workdir '</nextflow/work/temp/>' --outdir '</my/project/>' --email <john.doe@themailplace.com> --sras '</dir/to/sras/*>'
    

Directory paths for sras/fastqs must be enclosed in quotes. Notice the name of the configuration file specified by '-profile'. It's generally a good idea to keep separate configuration files for samples using different reference genomes, and different organisms. The pipeline runs paired-end by default. To run single-end data, you must add the --singleEnd argument.

If anything went wrong, you don't need to restart the pipeline from scratch. Instead...

$ nextflow run main.nf  -profile slurm_grch38 -resume

To see a full list of options and pipeline version, enter:

$ nextflow run main.nf -profile fiji --help
Parallel-fastq-dump Installation

As of verison 0.4, we have implemented a wrapper for fastq-dump for multi-threading in place of fasterq-dump due to memory leak issues. This, however, requires the installation of parallel-fastq-dump to your user home. You can do so by running:

$pip3 install parallel-fastq-dump --user

This will check for the sra-tools requirement, so if you do not want this installed to your user then this dependency must already be loaded to your path (i.e. module load sra/2.9.2).

This has been added as an option and the pipeline will run fastq-dump (single core) by default. To run multi-threading on 8 cores, you must specify --threadfqdump as a nextflow run argument.

Software Requirements

Python3, RSeQC, preseq, Picard Tools, BEDTools, Samtools, HISAT2, BBMap Suite, MultiQC, SRA Tools, IGV Tools

Running Nextflow Using an sbatch script

The best way to run Nextflow is using an sbatch script using the same command specified above. It's advisable to execute the workflow at least in a screen session, so you can log out of your cluster and check the progress and any errors in standard output more easily. Nextflow does a great job at keeping logs of every transaction, anyway, should you lose access to the console. The memory requirements do not exceed 8GB, so you do not need to request more RAM than this. SRAs must be downloaded prior to running the pipeline.

Arguments

Required Arguments

Arugment Usage Description
-profile <base,fiji> Configuration profile to use.
--fastqs </project/*_{1,2}*.fastq.gz> Directory pattern for fastq files (gzipped).
--sras </project/*.sra> Directory pattern for sra files.
--workdir </project/tmp/> Nextflow working directory where all intermediate files are saved.
--email <EMAIL> Where to send workflow report email.

Save Options

Arguments Usage Description
--outdir </project/> Specifies where to save the output from the nextflow run.
--savefq Compresses and saves raw fastq reads.
--saveTrim Compresses and saves trimmed fastq reads.
--saveAll Compresses and saves all fastq reads.
--skipBAM Skip saving BAM files (CRAM saves by default).
--savebw Save normalized BigWig files for UCSC genome broswer.
--savebg Saves concatenated pos/neg bedGraph file.
--savedup Save deduplicated/marked duplicate BAM files (using picard, cannot be used with --skippicard).

Input File Options

Arguments Usage Description
--singleEnd Specifies that the input files are not paired reads (default is paired-end).

Performance Options

Arguments Usage Description
--threadfqdump Runs multi-threading for fastq-dump for sra processing.

QC Options

Arguments Usage Description
--skipMultiQC Skip running MultiQC.
--skipRSeQC Skip running RSeQC.
--skippreseq Skip running preseq.
--skipFastQC Skip running FastQC
--skippileup Skip running pileup.
--skipAllQC Skip running all QC (does not include mapstats).
--noTrim Skip trimming and only run mapping.
--dedup Remove sequencing duplicates from BAM files (using picard, cannot be used with --skippicard).