/AWS-iGenomes-build

Building script for AWS-iGenomes

Primary LanguageNextflowMIT LicenseMIT

Build AWS-iGenomes

MIT License Install with bioconda Docker Container available

Common reference genomes hosted on AWS S3

Building script for AWS-iGenomes

Download script & command builder: https://ewels.github.io/AWS-iGenomes/

Amazon Web Services

Introduction

In NGS bioinformatics, a typical analysis run involves aligning raw DNA sequencing reads against a known reference genome. A different reference is needed for every species, and many species have several references to choose from. Each tool then builds its own indices against these references. As such, one analysis run typically requires a number of different files. For example: raw underlying DNA sequence, annotation (GTF files) and index file for use the chosen alignment tool.

These files are quite large and take time to generate. Downloading and building them for each AWS run often takes a significant of the total run time and resources, which is very wasteful. The iGeomes initiative aims to collect and standardise a number of common species, references and tool indices. To help with this, we have created an AWS S3 bucket containing the illumina iGenomes references, with a few additional indices for a extra tools on top of this base dataset.

This data is hosted in an S3 bucket (~5TB) and crucially is uncompressed (unlike the .tar.gz files held on the illumina iGenomes FTP servers). AWS runs can by pull just the required files to their local file storage before running. This has the advantage of being faster, cheaper and more reproducible.

Credits

The iGenomes resource was created by illumina. All credit for the collection and standardisation of this data should go to them!

This S3 resource was set up and documented by Phil Ewels (@ewels). The additional references not found in the base iGenomes resource were created with the help of Wesley Schaal (@wschaal) - a system administrator at UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science).

The resource was initially developed for use at the National Genomics Infrastructure at SciLifeLab in Stockholm, Sweden.


SciLifeLab National Genomics Infrastructure