A snakemake pipeline for the analysis of ATAC-seq data
Snakemake pipeline made for reproducible analysis of paired-end Illumina ATAC-seq data. This pipeline is similar on many point to the Snakemake_ChIP-seq pipeline, it includes steps of mapping to the mitochondrial and chloroplast genome of tomato as contamination by these organelles occurs often in ATAC-seq. Post-processing also includes the tool AlignementSieve, for the adjustement of the binding site.
For now I chose to include the fasta files for the mitochondrial and chloroplast genome directly in the repository, in the future I would like the pipeline to be able to fetch them.
The bowtie2 alignment is modified to output as well the unmapped reads using --un-conc-gz
flag. The unmapped reads and total reads are then aligned to the reference, mitochondrial and chloroplast genome. The output of the mapping is then analysed with samtools flagstat
and hopefully nicely ploted thanks to MultiQc.
-
Snakefile containing the targeted output and the rules to generate them from the input files.
-
config/ , folder containing the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules. Adapt the config file and its reference in the Snakefile.
-
Fastq/, folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk:
seqtk sample -s100 read1.fq 5000 > sub1.fqseqtk sample -s100 read2.fq 5000 > sub2.fq
. RAW fastq or fastq.gz files should be placed here before running the pipeline. -
envs/, folder containing the environment needed for the Snakefile to run. To use Snakemake, it is required to create and activate an environment containing snakemake (here : envs/global_env.yaml )
-
units.tsv, is a tab separated value files containing information about the experiment name, the condition of the experiment (control or treatment) and the path to the fastq files relative to the Snakefile. Change this file according to your samples.
First, you need to create an environment for the use of Snakemake with Conda package manager.
- Create a virtual environment named "atacseq" from the
global_env.yaml
file with the following command:conda env create --name atacseq --file ~/envs/global_env.yaml
- Then, activate this virtual environment with
source activate atacseq
The Snakefile will then take care of installing and loading the packages and softwares required by each step of the pipeline.
The ~/configs/config_tomato_sub.yaml
file specifies the sample list, the genomic reference fasta file to use, the directories to use, etc. This file is then used to build parameters in the main Snakefile
.
The Snakemake pipeline/workflow management system reads a master file (often called Snakefile
) to list the steps to be executed and defining their order.
It has many rich features. Read more here.
Samples are listed in the units.tsv
file and will be used by the Snakefile automatically. Change the name, the conditions accordingly.
Use the command snakemake -np
to perform a dry run that prints out the rules and commands.
Simply type Snakemake --use-conda
and provide the number of cores with --cores 10
for ten cores for instance.
For cluster execution, please refer to the Snakemake reference.
Please pay attention to --use-conda
, it is required for the installation and loading of the dependencies used by the rules of the pipeline.
To run the pipeline, from the folder containing the Snakefile run the
The main output are for now fastqc, bed and bigwig files. Optionals outputs of the pipelines are bamCompare, bedgraph and bed files for broad peaks calling. With the implementation of deeptools rules, the pipeline produces as well those outputs:
-
plotFingerprint contains interesting figures that answer the question: "Did my ChIP work???" . Explanation of the plot and the options available can be found here
-
PLOTCORRELATION folder contain pdf files displaying the correlation between the samples tested in the ATAC experiment, many options in the plotcorrelation rules can be changed via the configuration file. More information about this plot can be found here
-
HEATMAP folder contain pdf files displaying the content of the matrix produced by the
computeMatrix
rule under the form of a heatmap. Many option for thecomputeMatrix
and theplotHeatmap
rules can be changed in the configuration file. More information about this figure can be found here. -
plotProfile folder contain pdf files displaying profile plot for scores over sets of genomic region, again the genomic region are define in the matrix made previously. Again there are many options to change the plot and more information can be found here
Most of the parameters for the analysis are inspired by the original ATAC-seq paper
For peak-calling and deeptools outputs, the read start sites was adjusted to represent the center of the transposon binding event.
Previous descriptions of the Tn5 transposase show that the transposon binds as a dimer and inserts two adaptors separated by 9 bp. Therefore, all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp, this is performed by the rule alignmentsieve
.