Bulk RNA-seq analysis pipeline with Snakemake for Broad Institute UGER cluster
Snakemake is a Pythonic workflow description language, that is easily configurable to run in all sorts of environments. Since version 4.1, Snakemake contains a feature called 'profiles', for easy exchange of configuration presets for running in a certain environment. This repository contains a snakemake bulk RNA-seq analysis pipeline to run on the Broad's UGER cluster. This pipeline uses the tools as shown in the below graphic.
Installation
Setting up the folder structure
The pipeline expects the following folder organization. Please use this as a model to set up your working space for the pipeline to work successfully.
Setting up snakemake profile for Broad Institute UGER cluster
Please follow the instructions on the Broad Institute GitHub page to set up the snakemake profile.
Preparing a conda environment
The recommended way to run the analysis in this repository is to setup a conda environment, where the package versions of the tools used can be controlled. For a windows machine, please follow the below installation instructions on the Broad cluster.
use Anaconda3
# Create new conda environment with the environment.yml file provided in this repository
dos2unix environment.yml
conda env create -f environment.yml
Installing R Studio and DESeq2 package
DESeq2 is a R Bioconductor package that is used for differential expression analysis. This tool allows you to have more than two experimental groups and account for a second experimental factor. This tool takes as input a table of raw counts.
- To install RStudio and R, please follow the instructions [here][hr]. [hr]: https://uvastatlab.github.io/phdplus/installR.html
- Open RStudio and install DESeq2 using the instructions below:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
Using the pipeline
We're ready to go! To start the analysis:
- Update the config file parameters/config.yaml to ensure it has the right paths and sample names.
- Connect to the Broad login host
ssh login
# tmux command will keep your code running even if you disconnect
# To disconnect from the session, press CTRL+b, release both keys
# and then press d. On the original login shell, type tmux a to reconnect
# to your session, tmux ls to list all sessions, and tmux a -t [number] to
# connect to session [number].
tmux
cd scripts/
use Anaconda
source activate ../tools/snakemake
# The below command will run all the steps in the pipeline
# up until the alignment by STAR.
snakemake --profile broad-uger --cluster-config cluster.json
- Once the snakemake jobs are completed successfully, open R Studio. Load the script: /scripts/r/de.analysis.R
- Set the working directory as /scripts/r/ using R command setwd("/scripts/r/")
- Select all lines in the script and click on the 'Run' button at the top corner of your source window in RStudio.