/sra-pipeline

download sra files from SRA, pipe through fastq_dump and bowtie2 to S3, in a container

Primary LanguagePython

SRA Pipeline

This repository contains code for running an analysis pipeline in AWS Batch.

What the pipeline does

Given a set of SRA accession numbers, AWS Batch will start an array job where each child will process a single accession number, doing the following:

  • Download the file(s) associated with the accession number from SRA, using the prefetch tool with the Aspera Connect transport.
  • Start a bash pipe which runs the following steps, once for each of three viral genomes.
    • extracts the downloaded .sra file to fastq format using fastq-dump. The sra file is highly compressed and this step can expand it to more than 20 times its size, which is one reason we stream the data in a pipe: so as to not need lots of scratch space.
    • Pipe the fastq data through bowtie2 to search for the virus.
    • Pipe the output of bowtie2 through gzip to compress it prior to the next step.
    • stream the compressed output of bowtie2 to an S3 bucket. The resulting file will have an S3 URL like this: s3://<bucket-name>/pipeline-results2/<SRA-accession-number>/<virus>/<SRA-accession-number>.sam.gz.

Prerequisites/Requirements

  • These tools must all be run on the Fred Hutch internal network.
  • Obtain your S3 credentials using the awscreds script. You only need to do this once.
  • Request access to AWS Batch.
  • Clone this repository to a location under your home directory, and then change directories into the repository (you only need to do this once, although you may need to run git pull periodically to keep your cloned repository up to date):
git clone https://github.com/FredHutch/sra-pipeline.git
cd sra-pipeline

sra_pipeline utility

A script called sra_pipeline is available to to simplify the following:

  • Display accession numbers that have already been processed.
  • Display accession numbers which are currently being processed.
  • Submit some number of new accession numbers to the pipeline, choosing either randomly, by picking the smallest available data sets, or by providing a file containing accession numbers.

Running the utility with --help gives usage information:

$ ./sra_pipeline --help
usage: sra_pipeline.py [-h] [-c] [-i] [-s N] [-r N] [-f FILE]

optional arguments:
  -h, --help            show this help message and exit
  -c, --completed       show completed accession numbers
  -i, --in-progress     show accession numbers that are in progress
  -s N, --submit-small N
                        submit N jobs of ascending size
  -r N, --submit-random N
                        submit N randomly chosen jobs
  -f FILE, --submit-file FILE
                        submit accession numbers contained in FILE

Additional monitoring of jobs

You can get more detail about running jobs by using
the Batch Dashboard and/or the AWS command-line client for Batch.

See Using AWS Batch at Fred Hutch for more information.