/sra_download

Software to help automate the bulk downloading of SRA files from NCBI to a SLURM cluster using the SRA Toolkit.

Primary LanguagePython

READ ME

This software will help to automate the bulk downloading of SRA files from NCBI to a SLURM cluster using the SRA Toolkit.


STEP 1

Create a conda environment called 'sra_env' with python 3 & install the SRA Toolkit

	conda create --name sra_env python=3
	conda activate sra_env
	conda install -c bioconda sra-tools



STEP 2

Place all scripts in the same directory & run the setup.

	python sra_pipe.py -setup



STEP 3

Move the accession list file to the accessions directory.

	~/sra_download/accessions/SraAccList.txt

Example list format:

	SRR0000001
	SRR0000002
	SRR0000003
	...

N.B. at the time of writing it was possible to easily obtain such lists via https://www.ncbi.nlm.nih.gov/sra/ by searching for samples & then clicking "Send to" -> "File" -> "Accession list" -> "Create File" in the drop down menu.



STEP 4

Run stage 1 to download sra files via SRA Toolkit's prefetch command.

	python sra_pipe.py -stage 1

N.B. sometimes downloads will encounter a "Cannot KStreamRead" error; submitting with the optional "-buddy" flag will also submit sra_buddy.py that regularly monitors tasks & automatically re-submits those that fail due to this.



STEP 5

Run stage 2 to convert sra files to fastq files via SRA Toolkit's fasterq-dump command with default settings (equivalent to --split-3 & --skip-technical). A fastq extension will also be added to any extension-less files & all fastqs will then be compressed via gunzip.

	python sra_pipe.py -stage 2



CHECKING TASK PROGRESS / ACCOUNTING DATA

As tasks are being processed the progress of output files & SLURM accounting data can be monitored via the appropriate flags.

	python sra_pipe.py -stage 1 -progress
	python sra_pipe.py -stage 1 -accounting



ADDITIONAL SETTINGS

Various other settings can also be changed from their defaults:

FLAG		DEFAULT

-overwrite	overwrite existing files
-pipe		submit pipeline itself so new tasks can be submitted from the hpcc when possible (useful when many accessions)
-u <name>	find current user
-p </path/to/>	find home directory
-d <name>	sra_download as working directory
-a <name>	SraAccList.txt as accession list file
-e <name>	sra_env as conda environment name
-l <number>	100 parellel task limit 

-pa <name>	defq partition (specific to the UoN hpcc augusta)
-no <number>	1 node
-nt <number>	1 ntask
-me <number[k|m|g|t]>	4g memory
-wt <HH:MM:SS>	24:00:00 walltime