ENCODE rna-seq-pipeline @ Computational Biology (TKI)

Content

Overview
How To

Overview

The follwing is based on the ENCODE rna-seq-pipeline and is adjusted to include all the parts necessary for the analysis. This is the analysis DataLad structure. The Archictecture is as follows:

                         ANALYSIS DATASET
+-----------------------------------------+
|+---------------------------------------+|
||                                       ||
||                                       ||
||                                  DATA ||
|+---------------------------------------+|
|+---------------------------------------+|
||+----------------+                     ||
|||encode pipeline |                     ||
||+----------------+            PIPELINE ||
|+---------------------------------------+|
|+---------------------------------------+|
||                                       ||
||                                       ||
||                               INDICES ||
|+---------------------------------------+|
|+---------------------------------------+|
||                                       ||
||                                       ||
||                      ANALYSIS RESULTS ||
|+---------------------------------------+|
|+---------------------------------------+|
||                                       ||
||                                       ||
||               SINGULARITY [.sif file] ||
|+---------------------------------------+|
+-----------------------------------------+

Dependencies

In order for this pipeline to run smoothly, we need to set up the following pieces of software. These and all other dependencies only need to be installed once.

1. Miniconda

Miniconda is at the heart of the whole workflow, as we use it to create the pipeline environment. Feel free to use an updated version of miniconda. This is just the version I installed.

wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh
bash Miniconda3-py39_4.9.2-Linux-x86_64.sh

Throughout the installation, conda asks for three things:

Agree to the terms of reference
Confirm the location of the conda install (default is fine)
Run conda init after install to initialise conda. Here, please answere yes.

Also set the conda base autoload to deactivated

conda config --set auto_activate_base false

The following step is only necessary if you are working wihtin the zshell rahter than bash:

We need to copy the conda config from ~/.bash_profile (it might be ~/.bashrc in some cases) into ~/.zshrc

# >>> conda initialize >>>                                                                                                                                                                                         
# !! Contents within this block are managed by 'conda init' !!                                                                                                                                                     
__conda_setup="$('/home/sebastian/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/sebastian/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/home/sebastian/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/sebastian/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

2. Create a conda environment

We need to create a conda environment to run the pipeline. The following three pieces need to be installed:

DataLad
caper
croo

Make sure you have activated the bioconda channel:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Now we can create our conda environment:

conda create -n ENCODErnaSeq
conda activate ENCODErnaSeq

conda install -c conda-forge datalad 
conda install -c bioconda caper 
conda install -c bioconda croo

Data

This Datalad data set is the one piece that will vary for the analysis. It contains the fastq files for the specific analysis. In case this is not a DataLad data set already, please run the following

datalad create --force [fastq_directory]
cd [fastq_directory]
datalad save -m "Add data"

Now we can “install” it via:

datalad clone [location/of/the/data]

Pipeline

This is a DataLad container that contains the encode rna seq pipeline and all necessary scripts to run it. Importantly, for the paired end read as an example, the script createINPUTjson.sh needs to be run with the project specific parameters (single end read information available under Run the pipeline):

./createINPUTjson.sh -r [Read Identifier: This can be READ or R ] \
                     -f [Path to fastq files (this is relative path)] \
                     -e [File ending] \

Indices

This is a DataLad data set that contains the indices necessary for the RNA seq pipeline. You can get them via datalad get INDICES.

Analysis Results

This is where the output of the pipeline will finally reside.

Singularity [.sif file]

As the pipeline requires a singularity container to reproducibly run, and we do not always want to create the image from scratch, if we do now have it installed, I put the .sif file in the data architecture. This unfortunately does not work with containers_add, as the datalad call does not include the container call itself. Hence the container won’t be called from the DataLad .datalad folder.

How To

0. ssh setup for access to server

As some of the pipeline dependencies are located on a remote server, the server needs to be accessible by the analyst. Hence, a key would need to have been created. Further, the following setup needs to be added to the file ~~/.ssh/config~:

Host rnaseq
     Hostname 146.118.64.152
     User [YOUR USERNAME]
     IdentityFile [LOCATION OF YOUr PRIVATE KEY (can be ~/.ssh/keyname)

In order to get the pipeline running, we need to first assemble the individual pieces:

1. Get the pieces

# Activate the conda environment
conda activate ENCODErnaSeq

# Get the folder contents from their (remote) locations
datalad get INDICES
datalad get PIPELINE
datalad get SINGULARITY

# Add the fastq data
datalad clone [DATA/LOCATION]

# Rename the folder to DATA
mv [DATA/FOLDER/NAME] DATA

# Get the data
datalad get DATA

2. Set up the folder

To set up the datalad data set, we need to retrieve all the data. To make sure this works, make sure datalad is activated and then run:

bash setup.sh

3. Create the input json file for the workflow

3.1 Paired end reads

We need to create a input file for the rna-seq-pipeline, which we can do with the above mentioned script:

bash PIPELINE/scripts/createINPUTjson.sh -r [Read Identifier: This can be READ or R ] \
                                         -f [Path to fastq files (this is relative path) ] \
                                         -e [File ending] \

3.2 Single end reads

This is the same, but we use the single end read script:

bash PIPELINE/scripts/createINPUTjson_singleEND.sh -r [Read Identifier: This can be READ or R ] \
                                         -f [Path to fastq files (this is relative path) ] \
                                         -e [File ending] \

4. Run the pipeline

Make sure the dataset is up to date and everything is saved by running

datalad save -m "Check dataset"

4.1 Local, without slurm

Now we have all the missing pieces together and can run the pipeline with the following command on a local machien without slurm backend:

datalad run -m "Run rna seq pipeline" \
               "bash PIPELINE/scripts/rnaSeq_local.sh"

sebrauschert/encode_pipeline_tki