iRODS-Downloader

Installation

To download the latest binary release the following one liner can be used

curl -s https://api.github.com/repos/seanlaidlaw/iRODS-Downloader/releases/latest | grep browser_download_url | cut -d '"' -f 4 | wget -qi -

Dependencies

Server side Dependencies

As this is essentially a pipeline that calls other functions, quite a few dependencies are required to be installed on the server it's run on.

IRODS - The data management system used to store the raw sequencing data
LSF - the Job Scheduler used on the Sanger cluster that the wrapper uses to submit jobs and check job completion status

Optional Dependencies

Go - the language the pipeline is written in and required for compilation from source but not if using the binaries from the release
viper library for Golang config management

Usage

Authentication

Before the script can be used, iRODS authentication is required. Failiure to do so will result in the error: failed with error -993000 PAM_AUTH_PASSWORD_FAILED. iRODS authentication can be done by running iinit just before running the script, e.g.

$ iinit
Enter your current PAM password:

Command line arguments

There are two required arguments, -r for specifying the run and -l for specifying the lane. Both must be provided for the script to run properly.

$ ./irods_downloader -r 1234 -l 1

each run will download and process the samples in the working directory, so make sure to create a specific directory before running. Additionally, if multiple lanes need to be downloaded, the command will have to be run multiple times, each time in a separate directories. If errors occur, rerunning the command in the same directory will attempt to pick up where the downloader left off thanks to the checkpoint json files that irods_downloader produces as it goes.

Configuration

irods_downloader will look for a configuration file named irods_downloader_config.yaml to know where to look for the program dependencies as well as what library_types it should class as RNA vs DNA. Config file matching this filename are looked for first in the working directory (thus allowing for project specific configs), then $HOME/.config/ is searched, and finally if neither location contains a config the default versions are used.

Example YAML configuration file. This is a valid config file with the default values:

bwa_align_libraries: ["GnT Picoplex"]
attribute_with_sample_name: "sample_supplier_name"
samtools_exec: "/software/CASM/modules/installs/samtools/samtools-1.11/bin/samtools"
star_exec: "/nfs/users/nfs_r/rr11/Tools/STAR-2.5.2a/bin/Linux_x86_64_static/STAR"
star_genome_dir: "/lustre/scratch119/casm/team78pipelines/reference/human/GRCh37d5_ERCC92/star/75/"
bwa_exec: "/software/CASM/modules/installs/bwa/bwa-0.7.17/bin/bwa"
bwa_genome_ref: "/lustre/scratch119/casm/team78pipelines/reference/human/GRCH37d5/genome.fa"
featurecounts_exec: "/nfs/users/nfs_s/sl31/Tools/subread-2.0.1-Linux-x86_64/bin/featureCounts"
genome_annot: "/lustre/scratch124/casm/team78pipelines/canpipe/live/ref/Homo_sapiens/GRCh37d5_ERCC92/cgpRna/e75/ensembl.gtf"

Additionally if the default memory usage is not appropriate the config can take the optional additional settings for each tool's ram usage:

bwa_ram: "50000"
star_ram: "50000"
featurecounts_ram: "20000"

Outputs

A_iRODS_CRAM_Downloads

the downloaded CRAM and imeta files are stored here

B_Fastq_Extraction

this is the location the gz compressed fastq files, extracted from the crams in A_iRODS_CRAM_Downloads

C_Split_by_Library_Type

here is where symlinks to the split fastqs are stored, in separate folders for each library_type. Additionally they are named no longer by iRODS filename but by the sample name obtained from imeta

D_realignments

here is where the realigned bam files are output, following the library_type separated folder structure like before. The realigned bams are sorted before writing to disk, and are indexed in step 7 of analysis.

E_Counts_matrix_RNA

if there are bams that have a library_type specified as RNA, the produced counts matrix for those bams is computed and stored here.