This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq or DNase-seq data. The pipeline can be run on compute clusters with job submission engines or stand alone machines. It inherently makes uses of parallelized/distributed computing. Pipeline installation is also easy as most dependencies are automatically installed. The pipeline can be run end-to-end i.e. starting from raw FASTQ files all the way to peak calling and signal track generation; or can be started from intermediate stages as well (e.g. alignment files). The pipeline supports single-end or paired-end ATAC-seq or DNase-seq data (with or without replicates). The pipeline produces formatted HTML reports that include quality control measures specifically designed for ATAC-seq and DNase-seq data, analysis of reproducibility, stringent and relaxed thresholding of peaks, fold-enrichment and pvalue signal tracks. The pipeline also supports detailed error reporting and easy resuming of runs. The pipeline has been tested on human, mouse and yeast ATAC-seq data and human and mouse DNase-seq data.
The ATAC-seq pipeline specification is also the official pipeline specification of the Encyclopedia of DNA Elements (ENCODE) consortium. The ATAC-seq pipeline protocol definition is here. Some parts of the ATAC-seq pipeline were developed in collaboration with Jason Buenrostro, Alicia Schep and Will Greenleaf at Stanford.
- Flexibility: Support for
docker
,singularity
andConda
. - Portability: Support for many cloud platforms (Google/DNAnexus) and cluster engines (SLURM/SGE/PBS).
- User-friendly HTML report: tabulated quality metrics including alignment/peak statistics and FRiP along with many useful plots (IDR/cross-correlation measures).
- ATAqC: Annotation-based analysis including TSS enrichment and comparison to Roadmap DNase.
- Genomes: Pre-built database for GRCh38, hg19, mm10, mm9 and additional support for custom genomes.
This pipeline supports many cloud platforms and cluster engines. It also supports docker
, singularity
and Conda
to resolve complicated software dependencies for the pipeline. A tutorial-based instruction for each platform will be helpful to understand how to run pipelines. There are special instructions for two major Stanford HPC servers (SCG4 and Sherlock).
- Cloud platforms
- Web interface
- CLI (command line interface)
- Stanford HPC servers (CLI)
- Cluster engines (CLI)
- Local computers (CLI)