`eggo`

Eggo is two things:

CLI for easily provisioning fully-functioning Hadoop clusters (CDH) using Cloudera Director
A set of Parquet-formatted public 'omics data sets in S3 for easily performing integrative genomics on the Hadoop stack (including Spark and Impala).

Eggo includes all the of scripts for processing the data, including the necessary DDL statements to register the data sets with the Hive Metastore and make them accessible to Hive/Impala.

At the moment, Eggo is geared specifically towards scaling up variant stores and related functionality (e.g., population genomics, clinical genomics)

The pre-conerted data sets are hosted at a publicly available S3 bucket:

s3://bdg-eggo

See the datasets/ directory for a list of available data sets (with metadata conforming to the DataPackage spec).

Getting started

pip install git+https://github.com/bigdatagenomics/eggo.git

Eggo makes use of Fabric, Boto, and Click.

`eggo` command -- provisioning clusters

Simply run eggo at the command line. The eggo tool expects the following four environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
EC2_KEY_PAIR -- the name of the EC2-registered key pair to use for instance authentication
EC2_PRIVATE_KEY_FILE -- the local path to the corresponding private key

$ eggo -h
Usage: eggo [OPTIONS] COMMAND [ARGS]...

  eggo -- provisions Hadoop clusters in AWS using Cloudera Director

Options:
  -h, --help  Show this message and exit.

Commands:
  describe   Describe the EC2 instances in the cluster
  login      Login to gateway node of cluster
  provision  Provision a new cluster on AWS
  setup      DOES NOTHING AT THE MOMENT
  teardown   Tear down a cluster and stack on AWS

`eggo provision`

$ eggo provision -h
Usage: eggo provision [OPTIONS]

  Provision a new cluster on AWS

Options:
  --region TEXT                  AWS Region  [default: us-east-1]
  --stack-name TEXT              Stack name for CloudFormation and cluster
                                 name  [default: bdg-eggo]
  --availability-zone TEXT       AWS Availability Zone  [default: us-east-1b]
  --cf-template-path TEXT        Path to AWS Cloudformation Template
                                 [default: /usr/local/lib/python2.7/site-packa
                                 ges/eggo-0.1.0.dev0-py2.7.egg/eggo/cluster/cl
                                 oudformation.template]
  --launcher-ami TEXT            The AMI to use for the launcher node
                                 [default: ami-00a11e68]
  --launcher-instance-type TEXT  The instance type to use for the launcher
                                 node  [default: m3.medium]
  --director-conf-path TEXT      Path to Director conf for AWS cloud
                                 [default: /usr/local/lib/python2.7/site-packa
                                 ges/eggo-0.1.0.dev0-py2.7.egg/eggo/cluster/aw
                                 s.conf]
  --cluster-ami TEXT             The AMI to use for the worker nodes
                                 [default: ami-00a11e68]
  -n, --num-workers INTEGER      The total number of worker nodes to provision
                                 [default: 3]
  -h, --help                     Show this message and exit.

Eggo data sets

`datasets/`

OLD OLD OLD OLD OLD OLD OLD OLD

DELETE DELETE DELETE DELETE

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

`registry/`

The registry/ directory contains the metadata for the data sets we ingest and convert to ADAM/Parquet. Each data set is stored as a JSON file loosely based on the Data Protocols spec.

Environment

If using AWS, ensure the following variables are set locally:

export SPARK_HOME= # local path to Spark
export EC2_KEY_PAIR= # EC2 name of the registered key pair
export EC2_PRIVATE_KEY_FILE= # local path to associated private key (.pem file)
export AWS_ACCESS_KEY_ID= # AWS credentials
export AWS_SECRET_ACCESS_KEY= # AWS credentials

These variables must be set remotely, which can be done by source eggo- ec2-variables.sh:

export AWS_ACCESS_KEY_ID= # AWS credentials
export AWS_SECRET_ACCESS_KEY= # AWS credentials
export SPARK_HOME= # remote path to Spark
export ADAM_HOME= # remote path to ADAM
export SPARK_MASTER= # Spark master host name

Setting up a cluster

Set EGGO_EXP=TRUE to have the setup commands use the experiment branch of eggo.

cd path/to/eggo

# provision a cluster on EC2 with 5 slave (worker) nodes
fab provision:5,r3.2xlarge

# configure proper environment on the instances
fab setup_master
fab setup_slaves

# (Cloudera infra-only)
./tag-my-instances.py

# get an interactive shell on the master node
fab login

# destroy the cluster
fab teardown

There is experimental support for using Cloudera Director to provision a cluster. This is useful for running a cluster with more services, including YARN, the Hive metastore, YARN, and Impala; however it takes longer (>30mins) to bring up a cluster than the Spark EC2 scripts.

# provision a cluster on EC2 with 5 worker nodes
fab provision_director

# run a proxy to access Cloudera Manager via http://localhost:7180
# type 'exit' to quit process
fab cm_web_proxy

# log in to the gateway node
fab login_director

# destroy the cluster
fab teardown_director

Converting data sets

The toast command will build the Luigi DAG for downloading the necessary data to S3 and running the ADAM command to transform it to Parquet.

# toast the 1000 Genomes data set
fab toast:registry/1kg.json

Configuration

Environment variables that should be set

ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem -s 3 -t m3.large -z us-east-1a --delete-groups --copy-aws-credentials launch eggo ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem login eggo ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem destroy eggo

curl http://169.254.169.254/latest/meta-data/public-hostname

def verify_env(): require('SPARK_HOME') require('EC2_KEY_PAIR') require('EC2_PRIVATE_KEY_FILE') require('AWS_ACCESS_KEY_ID') require('AWS_SECRET_ACCESS_KEY')

TODO: have to CLI commands: eggo for users and toaster for maintainers.

Testing

You can run Eggo from a local machine, which is helpful while developing Eggo itself.

Ensure that Hadoop, Spark, and ADAM are all installed.

Set up the environment with:

export AWS_DEFAULT_REGION=us-east-1
export EPHEMERAL_MOUNT=/tmp
export ADAM_HOME=~/workspace/adam
export HADOOP_HOME=~/sw/hadoop-2.5.1/
export SPARK_HOME=~/sw/spark-1.3.0-bin-hadoop2.4/
export SPARK_MASTER_URI=local
export STREAMING_JAR=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar
export PATH=$PATH:$HADOOP_HOME/bin

By default, datasets will be stored on S3, and you will need to set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey in Hadoop's core-site.xml file.

To store datasets locally, set the EGGO_BASE_URL environment variable to a Hadoop path:

export EGGO_BASE_URL=file:///tmp/bdg-eggo

Generate a test dataset with

bin/toaster.py --local-scheduler VCF2ADAMTask --ToastConfig-config test/registry/test-genotypes.json

bin/toaster.py --local-scheduler BAM2ADAMTask --ToastConfig-config test/registry/test-alignments.json

You can delete the test datasets with

bin/toaster.py --local-scheduler DeleteDatasetTask --ToastConfig-config test/registry/test-genotypes.json
bin/toaster.py --local-scheduler DeleteDatasetTask --ToastConfig-config test/registry/test-alignments.json

NEW config-file-based organization

Concepts:

dfs: the target "distributed" filesystem that will contain the final ETL'd data
workers: the machines on which ETL is executed
worker_env: an environment which we assume available on the worker machines, including env variables and paths to write data
client: the local machine from which we issue the CLI commands
client_env: the environment assumed on the local machine

The only environment variable that MUST be set on the local client machine is EGGO_CONFIG. This config file will be deployed across all relevant worker machines.

Other local client env vars that will be respected include: SPARK_HOME, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, EC2_KEY_PAIR, EC2_PRIVATE_KEY_FILE. Everything else is derived from the EGGO_CONFIG file.

One of the workers is designated a master, which is where the computations are executed. This node needs additional configuration.

eggo provision
eggo deploy_config
eggo setup_master
eggo setup_slaves
eggo delete_all:config=$EGGO_HOME/test/registry/test-genotypes.json
eggo toast:config=$EGGO_HOME/test/registry/test-genotypes.json
eggo teardown

massie/eggo

eggo