Introduction

Project jbrowse-adam implements sample integration between JBrowse and ADAM file formats.

##Preliminary preparations:

To run jbrowse-adam we need some files with genomic data. Sample files is attached to project, but physically located on at Git LFS storage (https://git-lfs.github.com/). Please, before clone project, install this extension to property download them. File local.conf already configured to use this genomic data files in local-mode.

Alternatively we can convert full data (tutorial_files.zip at ftp://gsapubftp-anonymous@ftp.broadinstitute.org) in ADAM format, how to do it, it is written below.

##Launch application

###To run jbrowse-adam in "local-mode":

Before start, we need to install latest versions of Java, Scala and SBT, if they are not already installed.

In jbrowse-adam folder type sbt "run local" or launch sbt and type re-start local or in file application.conf set config.path = "local" and type sbt run.

###To run jbrowse-adam in "cluster-mode":

By default nothing no need to change in application.conf, when config.path = "cluster"

But, need to correct paths in file cluster.conf, see in example below.

###Example of launch on Amazon EMR Cluster

####Cluster creation

We tested project on these settings of cluster (when create cluster, switch to Advanced options):

Software and Steps:

  • Vendor: Amazon
  • Release: emr-4.3.0
  • Check Hadoop 2.7.1 and Spark 1.6.0, uncheck others

Hardware:

  • Master: 1x m3.xlarge
  • Core: 2x m3.xlarge
  • Task: 10x m3.xlarge

We also tested big data at r3.xlarge EC2 instances and received a boost in performance when process really big data.

General cluster Settings:

  • Uncheck termination protection

Security:

  • EC2 Key Pair - need to be created by EC2 admin and specified here. This pair need for ssh access.

Press Create Cluster button

Now need to wait (about 7 min).

####Access to cluster via SSH and web browser

  • In Cluster details search Master public DNS and press SSH

  • Copy string for access to cluster from the console:

    ssh -i ~/you-key-pair.pem hadoop@ec2-XX-XX-XXX-XXX.us-west-1.compute.amazonaws.com

    It is meant that the key pairs file you-key-pair.pem are in the user's root directory, e.g. /home/user and have access rights 600 (chmod 600 ~/you-key-pair.pem)

  • SSH into EMR master instance with command above.

  • To see work of cluster in browser - press Enable web connection and follow instructions, details see below.

####Preparing data and code base

  • Upload genomic data to S3 bucket. In order to reduce delays, the S3 bucket should be located in the same region as the EMR cluster.
  • Install git and sbt on cluster:
    sudo yum install git
    curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
    sudo yum install sbt
  • Clone code with:

    git clone --recursive https://github.com/FusionWorks/jbrowse-adam.git

  • cd jbrowse-adam

  • Edit paths to genomic data:

    nano src/main/resources/cluster.conf

    Change all filePath to yours paths at S3 bucket (s3n://...)

    Ctrl+O - save changes, Ctrl+X - exit

  • Assembly code with: sbt assembly

    Until the project is assembling, you can drink tea. It is a long process.

  • Launch jbrowse-adam with command:

    spark-submit \
    --master yarn-client \
    --num-executors 50 \
    --executor-memory 8g \
    --packages org.bdgenomics.adam:adam-core:0.16.0 \
    --class md.fusionworks.adam.jbrowse.Boot target/scala-2.10/jbrowse-adam-assembly-0.1.jar

This command works for extreme big genomic files (35+ Gb). You may decrease or remove at all (use default values): --num-executors, --executor-memory, --driver-memory.

####See results in browser:

Assume, that we have master public DNS: ec2-XX-XXX-XXX-XXX.us-west-1.compute.amazonaws.com. In apperas in Cluster details.

When web connection is enabled, we can access some interesting addresses:

  • JBrowse: http://ec2-XX-XXX-XXX-XXX.us-west-1.compute.amazonaws.com:8080
  • Spark jobs: http://ec2-XX-XXX-XXX-XXX.us-west-1.compute.amazonaws.com:4040
  • Alternatively, we can see Spark jobs with CSS styles in Cluster details -> Resource Manager -> Application master.

####Terminate cluster job:

  • Ctrl+C

###Convert genomic data to ADAM format (local example):

cd jbrowse-adam
sbt console
import md.fusionworks.adam.jbrowse.tools._
AdamConverter.vcfToADAM("file:///path/to/genetic/file_data.vcf", "file:///path/to/genetic/file_data.vcf.adam")

Available operations:

  • fastaToADAM
  • vcfToADAM
  • bam_samToADAM

If we got Out of memory errors, we should give to JVM more memory. For example:

sbt console -J-XX:-UseGCOverheadLimit -J-Xms1024M -J-Xmx2048M -J-XX:+PrintFlagsFinal

###Convert genomic data to ADAM format (EMR/S3 example):

cd jbrowse-adam

spark-submit \
--master yarn-client \
--num-executors 50 \
--conf spark.executor.memory=8g \
--driver-memory=8g \
--packages org.bdgenomics.adam:adam-core:0.16.0 \
--class md.fusionworks.adam.jbrowse.tools.ConvertToAdam \
target/scala-2.10/jbrowse-adam-assembly-0.1.jar \
s3n://path/to/legacy/genetic/file/_data.bam \
s3n://path/to/new/adam/genetic/file_data.bam.adam

This example works for extreme big files (35+ Gb). You may decrease or remove at all (use default values): --num-executors, --conf spark.executor.memory, --driver-memory.