This repository falls under the NIH STRIDES Initiative. STRIDES aims to harness the power of the cloud to accelerate biomedical discoveries. To learn more, visit https://cloud.nih.gov.
The sheer quantity of resources available to learn AWS can quickly become overwhelming. NIH Cloud Lab’s goal is to make cloud easy and accessible for you, so that you can spend less time on administrative tasks and focus more on your research.
Use this repository to learn about how to use AWS by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you begin with this jumpstart section. If you already have foundational knowledge of AWS and cloud, feel free to skip ahead to the tutorials section for in-depth examples of how to run specific workflows such as genomic variant calling and medical image analysis.
- Getting Started
- Overview
- Command Line Tools
- Amazon Marketplace
- Ingest and Store Data
- Virtual Machines in EC2
- Disk Images
- SageMaker Notebooks
- Creating a Conda Environment
- Managing Containers and Code Repositories
- Clusters
- Billing and Benchmarking
- Cost Optimization
- Getting Support
- Additional Training
You can learn a lot of what is possible on AWS in the AWS Getting Started Tutorials Page and we recommend you go there and explore some of the tutorials on offer. Nonetheless, it can be hard to know where to start if you are new to the cloud. To help you, we thought through some of the most common tasks you will encounter doing cloud-enabled research, and gathered tutorials and guides specific to those topics. We hope the following materials are helpful as you explore cloud-based research. For an alternative perspective, you can also check out Lynn Langit's AWS for Bioinformatics repo.
There are three primary ways you can run analyses using AWS: using Virtual Machines, Jupyter Notebook instances, and Serverless services. We give a breif overview of each of these here and go into more detail in the sections below. Virtual machines are like desktop computers, but you access them through the cloud console and you get to pick the operating system and the specs such as CPU and memory. In AWS, these virtual machines are called Elastic Compute Cloud or EC2 for short. Jupyter Notebook instances are virtual machines with preconfigured with Jupyter Lab. On AWS these are run through SageMaker, which is also AWS's ML/AI platform. You decide what kind of virtual machine you want to 'spin up' and then you can run Juptyer notebooks on that virtual machine. Finally, Serverless services are services that allow you to run things, an analysis, an app, a website, and not have to deal with your own servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background, and then see the outputs usually in a storage bucket.
Most tasks in AWS can be done without the command line, but the command line tools will generally make your life easier in the long run. Command line interface (CLI) tools are those that you use directly in a terminal/shell as opposed to clicking within a graphical user interface (GUI). The primary tool you will need is the AWS CLI, which will allow you to interact with instances or S3 buckets (see below) from your local terminal. Instructions for the CLI can be found here. If you are unable to install locally, you can use all the CLI commands from within EC2 and SageMaker instances, or from the Cloud Shell.
To configure the CLI, you will need to authenticate using access keys, which are unique strings that tell AWS that you are allowed to interact with the account. Within Cloud Lab, you will need to use Short Term Access Keys. If you are an NIH user, the instructions for accessing these are found here. Short Term Access keys differ from Long Term Access keys in that they only work for a short period of time. Once your time limit expires, you have to request new keys and then authenticate again.
If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using Amazon Omics which is a new service for genomic-data specific storage, serverless workflow execution (using WDL or Nextflow), and variant queries using Amazon Athena. For those who want to use other workflow managers, you can instead try the AWS Genomics CLI, which is a wrapper for genomics workflow managers and AWS Batch (serverless computing cluster). See our docs on how to set up the Genomics CLI for Cloud Lab. Supported workflow engines include Toil, Cromwell, minwdl, Nextflow, and Snakemake.
The AWS Marketplace is a platform similar to Amazon.com where you can search for and launch pre-configured solutions such as Machine Images. Examples of images you may launch would be those with enhanced security (see EC2 section) or ones opimized for various tasks like machine learning, platform-specific genomics, or accelerated genomics.
Data can be stored in two places on the cloud: either in a cloud storage bucket, which on AWS is called Amazon Simple Storage Service (S3), or on an instance, which usually has Elastic Block Storage. Block storage is storage with a finite size (e.g., 200 GB) that is located on your virtual machine. S3 is object storage, meaning that you can put any type of object in S3, and it is scalable, so there is no upper limit on storage size. There is a 5 TB limit on individial items that you upload, so if you needed to upload a larger file, you would need to break it into smaller pieces.
In general, you want to keep your compute and storage separate, so you should aim to store data in S3 for access, then only copy the data you need to a particular instance to run an analysis, then copy the results back to S3. In addition, the data on an instance is only available when the instance is running, whereas the data in S3 is always available, and serves as a longer term storage solution. Here is a great tutorial on how to use S3 and is worth going through to learn how it all works.
We also wanted to give you a few other tips that may be helpful when it comes to moving and storing data. If your end goal is to move data to an S3 bucket, you can do that using the UI and clicking the Upload
button, or you can use the CLI by typing aws s3 cp <FILE> <s3://BUCKET>
. If you want to move a whole folder, then use the --recursive flag: aws s3 cp <DIR> <s3://BUCKET> --recursive
. The same applies whether moving data from your local directory or from an EC2 instance. Likewise, you can move data from S3 back to your local machine or your EC2 instance with aws s3 cp <s3://BUCKET/FILE> <DESTINATION/PATH>
. Finally, you can move data to an instance using scp, just make sure the instance is running. You can use a command like scp -i 'key.pem' <FILE> ec2-user@ec2-NAME.REGION.compute.amazonaws.com:~/PATH
. SCP is an SSH tool for copying local data to a remote server. Once the data is on the VM, it is a good idea to use aws s3 cp
to move data to S3. If you are trying to move data from the Short Read Archive (SRA) to an instance, or to S3, you can use the SRA Toolkit. Follow our SRA Toolkit tutorial for best practices.
There is some strategy to managing storage costs as well. When you have spun up a VM, you have already paid for the storage on the VM since you are paying for the size of the disk, whereas S3 storage is charged based on how much data you put in your buckets. This is something to think about when copying results files back to S3 for example. If they are not files you will need later, then leave them on the VM's block storage and save your money on more important data to put in S3. Just make sure you delete the VM when you are finished with it. If the data is important though, either create a disk image as a backup, or copy it to S3, or both!
Virtual machines (VMs) on AWS are called Amazon Elastic Compute Cloud (EC2) and are like virtual computers that you access via SSH and which start as (nearly) completly blank slates. You have complete control over the VM configuration beginning with the operating system. You can choose a variety of Linux flavors, as well as macOS and Windows. Virtual Machines are organized into machine families with different functions, such as General Purpose, Compute Optimized, Accelerated Computing etc. You can also select machines with graphics processing units (GPUs), which run very quickly for some use cases, but also can cost more than most of the CPU machines. Billing occurs on a per second basis, and larger and faster machine types cost more per second. This is why it is important to stop or delete machines when not in use to minimize costs, and consider always using an auto shutdown script.
Many great resources exist on how to spin up, connect to, and work on a VM on AWS. The first place to direct you is the tutorial created by NIH Common Data Fund. This tutorial expects that you will launch an instance and work with it interactively. Here is an example developed by AWS that gives a good step by step on how to launch and access an instance using Amazon Lightsail. Lightsail is a simplified version of the full AWS console, and may provide an interface you may like better for using EC2. Note that resources you spin up in Lightsail will not be available in EC2. A lot of the tutorials in this repo will have instructions on spinning up EC2 instances as well. If you have issues with SSH, consider connecting via the Session Manager. We also wrote a guide with screen shots if that helps you.
If you want to launch a Windows VM, check out this tutorial.
From a security perspective, we recommend that you use Center for Internet Security (CIS) Hardened VMs. These have security controls that meet the CIS benchmark for enhanced cloud security. To use these VMs, go to the AWS Marketplace > Discover Products. Then search for CIS Hardened
and chose the OS that meets your needs. Click, Continue to Subscribe
in the top right, and then Continue to Configuration
and set your configuration parameters. Finally, click Continue to Launch
. Here you decide how to launch the Marketplace solution; we recommend Launch from EC2
, although you are welcome to experiment with the other options. Now click Launch
and walk through the usual EC2 launch parameters. Click Launch
and then you can view the status of your VM in the EC2 Instances page.
If you need to scale your VM up or down (see Cost Optimization below), you can always change the machine type by clicking on the instance ID, then go to Actions > Instance Settings > Change instance type
. The VM has to be stopped to change the instance type.
Part of the power of virtual machines is that they offer a blank slate for you to configure as desired. However, sometimes you want to recycle data or installed programs for your next VM instead of having to recreate the wheel. One solution to this issue is using disk (or machine) images, where you copy your existing virtual disk to an Amazon Machine Image which can serve as a backup, or can be used to launch a new instance with the programs and data from a previous instance.
Sagemaker is the AWS ML/AI development platform, as well as the hosted/managed Jupyter notebook platform. Notebooks are ideal for certain problems, particularly when doing a tutorial because you can mix code with instructions. They are also great for exploring your data or workflow one portion at a time, since the code gets broken up into little chunks that you can run one by one, which lends itself very well to most ML/AI problems. However, you can also open a terminal within Jupyter Lab, so you can switch between a traditional terminal and notebook interface. The notebook we are going to test here is inside this repo, but we are going to launch a SageMaker instance and then copy the notebook into AWS programatically.
Follow our Sagemake Notebook guide to learn how to spin up an instance and walk through an example notebook focused on genome-wide association studies.
Virtual environments allow you to manage package versions without having package conflicts. For example, if you needed Python 3 for one analysis, but Python 2.7 for another, you could create separate environments to use the two versions of Python. One of the most popular package managers used for creating virtual environments is the conda package manager.
Mamba is a reimplementation of conda written in C++ and runs much faster than legacy conda. Follow our guide to create a conda environment using Mamba in an EC2 or Sagemaker instance.
You can host containers within Amazon Elastic Container Registry. We outline how to build a container, push to Elastic Container Registry, and pull to a compute environment in our docs.
Further, you can manage your git repositories within your AWS account using AWS CodeCommit. Here we outline how to create a repository, authenticate to it, then push and pull files using standard git commands.
One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional cluster, you specify up front how many CPUs and memory you want to give to your job, and you may over or under utilize these resources. With managed resources like serverless and clusters you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (think about workshop participants all submitting jobs at the same time to a cluster). For most users of Cloud Lab, the best way to leverage scaling is to use AWS Batch, but in some cases, maybe for a whole lab group or large project, it may make sense to spin up a Kubernetes cluster. Note that if you spin up resources in Batch, you will need to deactivate the compute environment (in Batch) and delete the autoscaling groups (in EC2) to avoid further charges.
Many Cloud Lab users are interested in understanding how to estimate the price of a large scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of time and cost required for a larger scale project. Follow our Cost Management Guide to see how to tag specific resources for workflow benchmarking.
In terms of cost, the best way to estimate costs is to use the AWS pricing calculator here for an initial figure, which is a pricing tool that forcasts costs based on products and useage. Then, you can run some benchmarks and double check that everything is acting as you expect. For example, if you know that your analysis on your on-premesis cluster takes 4 hours to run for a single sample with 12 CPUs, and that each sample needs about 30 GB of storage to run a workflow, then you can extrapolate out how much everything may cost using the calculator (e.g. EC2 + S3).
Follow our Cost Management Guide for details on how to monitor costs, set up budget alerts, and cost-benchmark specific analyses using resource tagging. In addition, here are a few tips to help you stay on budget.
- Configure auto-shutdown on your EC2 instances following this guide. This will prevent you from accidentally leaving instances running.
- Make sure you shut down other resources after you use them, and periodically 'clean up' your account. This can include S3 buckets, virtual machines/notebooks, Batch environments and Cloud Formation scripts. For Batch environments, you will also need to go to EC2 and delete the autoscaling groups (far bottom left option on the EC2 page).
- Ensure that you are using all the compute resources you have provisioned. If you spin up a VM with 16 CPUs, you can see if they are all being utilized using CloudWatch. If you are only really using 8 CPUs for example, then just change your machine size to fit the analysis. You can also view our CPU optimization guide here.
- Explore using Spot Instances or Reserved for running workflows.
As part of your participation in Cloud Lab you will be added to the Cloud Lab Teams channel where you can chat with other Cloud Lab users, and gain support from the Cloud Lab team. For NIH Intramural users, you can submit a support ticket to Service Now. For issues related to the cloud environment, feel free to request AWS Enterprise Support. For issues related to scientific use cases, such as, how can I best run an RNAseq pipeline in AWS, email us at CloudLab@nih.gov
.
If you have a question about Quota Limits, visit our documentation on how to request a limit increase.
This repo only scratches the surface of what can be done in the cloud. If you are interested in additional cloud training opportunities please visit the STRIDES Training page. For more information on the STRIDES Initiative at the NIH, visit our website or contact the NIH STRIDES team at STRIDES@nih.gov for more information.