/iRODS-Downloader

Wrapper program for hassle-free downloading and realigning of sequencing data from iRODS on a cluster

Primary LanguageGo

Go Report Card Maintainability

iRODS-Downloader

Installation

To download the latest binary release the following one liner can be used

curl -s https://api.github.com/repos/seanlaidlaw/iRODS-Downloader/releases/latest | grep browser_download_url | cut -d '"' -f 4 | wget -qi -

Dependencies

Server side Dependencies

As this is essentially a pipeline that calls other functions, quite a few dependencies are required to be installed on the server it's run on.

  • IRODS - The data management system used to store the raw sequencing data
  • LSF - the Job Scheduler used on the Sanger cluster that the wrapper uses to submit jobs and check job completion status

Optional Dependencies

  • Go - the language the pipeline is written in and required for compilation from source but not if using the binaries from the release

  • viper library for Golang config management

Usage

Authentication

Before the script can be used, iRODS authentication is required. Failiure to do so will result in the error: failed with error -993000 PAM_AUTH_PASSWORD_FAILED. iRODS authentication can be done by running iinit just before running the script, e.g.

$ iinit
Enter your current PAM password:

Command line arguments

There are two required arguments, -r for specifying the run and -l for specifying the lane. Both must be provided for the script to run properly.

$ ./irods_downloader -r 1234 -l 1

each run will download and process the samples in the working directory, so make sure to create a specific directory before running. Additionally, if multiple lanes need to be downloaded, the command will have to be run multiple times, each time in a separate directories. If errors occur, rerunning the command in the same directory will attempt to pick up where the downloader left off thanks to the checkpoint json files that irods_downloader produces as it goes.

Configuration

irods_downloader will look for a configuration file named irods_downloader_config.yaml to know where to look for the program dependencies as well as what library_types it should class as RNA vs DNA. Config file matching this filename are looked for first in the working directory (thus allowing for project specific configs), then $HOME/.config/ is searched, and finally if neither location contains a config the default versions are used.

Example YAML configuration file. This is a valid config file with the default values:

bwa_align_libraries: ["GnT Picoplex"]
attribute_with_sample_name: "sample_supplier_name"
samtools_exec: "/software/CASM/modules/installs/samtools/samtools-1.11/bin/samtools"
star_exec: "/nfs/users/nfs_r/rr11/Tools/STAR-2.5.2a/bin/Linux_x86_64_static/STAR"
star_genome_dir: "/lustre/scratch119/casm/team78pipelines/reference/human/GRCh37d5_ERCC92/star/75/"
bwa_exec: "/software/CASM/modules/installs/bwa/bwa-0.7.17/bin/bwa"
bwa_genome_ref: "/lustre/scratch119/casm/team78pipelines/reference/human/GRCH37d5/genome.fa"
featurecounts_exec: "/nfs/users/nfs_s/sl31/Tools/subread-2.0.1-Linux-x86_64/bin/featureCounts"
genome_annot: "/lustre/scratch124/casm/team78pipelines/canpipe/live/ref/Homo_sapiens/GRCh37d5_ERCC92/cgpRna/e75/ensembl.gtf"

Additionally if the default memory usage is not appropriate the config can take the optional additional settings for each tool's ram usage:

bwa_ram: "50000"
star_ram: "50000"
featurecounts_ram: "20000"

Outputs

  • A_iRODS_CRAM_Downloads

the downloaded CRAM and imeta files are stored here

  • B_Fastq_Extraction

this is the location the gz compressed fastq files, extracted from the crams in A_iRODS_CRAM_Downloads

  • C_Split_by_Library_Type

here is where symlinks to the split fastqs are stored, in separate folders for each library_type. Additionally they are named no longer by iRODS filename but by the sample name obtained from imeta

  • D_realignments

here is where the realigned bam files are output, following the library_type separated folder structure like before. The realigned bams are sorted before writing to disk, and are indexed in step 7 of analysis.

  • E_Counts_matrix_RNA

if there are bams that have a library_type specified as RNA, the produced counts matrix for those bams is computed and stored here.