/CancerGenomicsCloud_tutorial

Tutorial for the sevenbridges cancer genomics cloud

Primary LanguageR

CancerGenomicsCloud

Sevenbridges cancer genomics cloud tutorial

There is a complete tutorial on using R for the Sevenbridges Cancer Genomics Cloud, by Tengfei Yin.

I have also wrote two additional READMEs, to support the pratical and theoretical course I give in IARC in feb. 2017:

1. R api to analyse TCGA data on the Cancer Genomics Cloud

1.1 Introduction

Sevenbridges maintained a GitHub repository for API client, CWL schema, meta schema and SDK helper in R, here.

Tutorials from Tengfei Yin for multiple tasks can be found on the GitHub, including intersting ones for TCGA data analysis:

A good schema for using R api to analyse TCGA data is the following:

  1. Create your docker image or use an existing one.
  2. Choose the machine you want (default is m4.2xlarge (8 CPUs, 32Gb, 40cts/h) You pay at least 1 hour.
  3. Create a tool or a workflow (directly in R or import CWL file, which can be written also on JSON or YAML)
  4. Add specific data to your project (use the queries to keep reproducibility)
  5. Run your analysis with a loop on your files

1.2 Examples of tools

Platypus / bgzip Workflow

1.3 Data queries

1.4 Example of TCGA analysis: germline calling on lung samples

Steps are the following:

  • use GUI to add lung BAM files to your project
  • use this R script to:
      1. load platypus and bgzip JSON tools
      1. connect them into a workflow
      1. add the workflow to your project (these 2 previous steps can be skipped if your app is already present in the project)
      1. loop over the BAM file to run the variant calling on each sample
      1. download locally each VCF file
      1. transfer each VCF from local computer to the IARC HPC
      1. delete VCF files on the CGC (don't forget the checking of VCF files downloading before this)

1.5 Task monitoring

R api could be use to analyse several task features, such as:

  • task execution time (queue + run)
  • task price (computing + storage)

This script is an example of task analysis, which produce this sort of picture.

1.6 Issues (API is currently in development)

2. Amazon Web Services (AWS) utils

2.1 Spot Instances

The CGC uses two types of Amazon EC2 pricing for instances: On-Demand and Spot. On-Demand instances are purchased at a fixed rate, while the price of Spot Instances varies according to supply and demand.

  • CGC strategy is to bid the On-Demand instance price for spot instances
  • AWS EC2 will terminate your spot instance if bid price < market price
  • in this case, task will continue on an On-Demand instance
  • if spot instance is terminated before 1h of running, not charged
  • spot instance are not recommended for critical-time jobs