The follwing is based on the ENCODE rna-seq-pipeline and is adjusted to include all the parts necessary for the analysis. This is the analysis DataLad structure. The Archictecture is as follows:
ANALYSIS DATASET +-----------------------------------------+ |+---------------------------------------+| || || || || || DATA || |+---------------------------------------+| |+---------------------------------------+| ||+----------------+ || |||encode pipeline | || ||+----------------+ PIPELINE || |+---------------------------------------+| |+---------------------------------------+| || || || || || INDICES || |+---------------------------------------+| |+---------------------------------------+| || || || || || ANALYSIS RESULTS || |+---------------------------------------+| |+---------------------------------------+| || || || || || SINGULARITY [.sif file] || |+---------------------------------------+| +-----------------------------------------+
In order for this pipeline to run smoothly, we need to set up the following pieces of software. These and all other dependencies only need to be installed once.
Miniconda is at the heart of the whole workflow, as we use it to create the pipeline environment. Feel free to use an updated version of miniconda. This is just the version I installed.
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh
bash Miniconda3-py39_4.9.2-Linux-x86_64.sh
Throughout the installation, conda asks for three things:
- Agree to the terms of reference
- Confirm the location of the conda install (default is fine)
- Run
conda init
after install to initialise conda. Here, please answereyes
.
Also set the conda base autoload to deactivated
conda config --set auto_activate_base false
The following step is only necessary if you are working wihtin the zshell rahter than bash:
We need to copy the conda config from ~/.bash_profile (it might be ~/.bashrc in some cases) into ~/.zshrc
# >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/home/sebastian/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/home/sebastian/miniconda3/etc/profile.d/conda.sh" ]; then . "/home/sebastian/miniconda3/etc/profile.d/conda.sh" else export PATH="/home/sebastian/miniconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<<
We need to create a conda environment to run the pipeline. The following three pieces need to be installed:
DataLad
caper
croo
Make sure you have activated the bioconda
channel:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Now we can create our conda environment:
conda create -n ENCODErnaSeq
conda activate ENCODErnaSeq
conda install -c conda-forge datalad
conda install -c bioconda caper
conda install -c bioconda croo
This Datalad
data set is the one piece that will vary for the analysis. It contains the fastq
files for the specific analysis.
In case this is not a DataLad
data set already, please run the following
datalad create --force [fastq_directory] cd [fastq_directory] datalad save -m "Add data"
Now we can “install” it via:
datalad clone [location/of/the/data]
This is a DataLad
container that contains the encode rna seq pipeline and all necessary scripts to run it. Importantly, for the paired end read as an example, the script createINPUTjson.sh
needs to be run with the project specific parameters (single end read information available under Run the pipeline):
./createINPUTjson.sh -r [Read Identifier: This can be READ or R ] \
-f [Path to fastq files (this is relative path)] \
-e [File ending] \
This is a DataLad
data set that contains the indices necessary for the RNA seq pipeline.
You can get them via datalad get INDICES
.
This is where the output of the pipeline will finally reside.
As the pipeline requires a singularity container to reproducibly run, and we do not always want to create the image from scratch, if we do now have it installed, I put the .sif file in the data architecture. This unfortunately does not work with containers_add, as the datalad call does not include the container call itself. Hence the container won’t be called from the DataLad .datalad folder.
As some of the pipeline dependencies are located on a remote server, the server needs to be accessible by the analyst. Hence, a key would need to have been created. Further, the following setup needs to be added to the file ~~/.ssh/config~:
Host rnaseq Hostname 146.118.64.152 User [YOUR USERNAME] IdentityFile [LOCATION OF YOUr PRIVATE KEY (can be ~/.ssh/keyname)
In order to get the pipeline running, we need to first assemble the individual pieces:
# Activate the conda environment
conda activate ENCODErnaSeq
# Get the folder contents from their (remote) locations
datalad get INDICES
datalad get PIPELINE
datalad get SINGULARITY
# Add the fastq data
datalad clone [DATA/LOCATION]
# Rename the folder to DATA
mv [DATA/FOLDER/NAME] DATA
# Get the data
datalad get DATA
To set up the datalad data set, we need to retrieve all the data. To make sure this works, make sure datalad is activated and then run:
bash setup.sh
We need to create a input file for the rna-seq-pipeline, which we can do with the above mentioned script:
bash PIPELINE/scripts/createINPUTjson.sh -r [Read Identifier: This can be READ or R ] \
-f [Path to fastq files (this is relative path) ] \
-e [File ending] \
This is the same, but we use the single end read script:
bash PIPELINE/scripts/createINPUTjson_singleEND.sh -r [Read Identifier: This can be READ or R ] \
-f [Path to fastq files (this is relative path) ] \
-e [File ending] \
Make sure the dataset is up to date and everything is saved by running
datalad save -m "Check dataset"
Now we have all the missing pieces together and can run the pipeline with the following command on a local machien without slurm backend:
datalad run -m "Run rna seq pipeline" \
"bash PIPELINE/scripts/rnaSeq_local.sh"