/hail-on-AWS-spot-instances

An option to spin cost effective EMR clusters in AWS with Hail and JupyterNotebook installed

Primary LanguagePythonApache License 2.0Apache-2.0

Hail on Amazon EMR: cloudformation tool with spot instances

This cloudformation tool (MAC and Linux compatible) creates an EMR 5.23.0 cluster with Spark 2.4.0, using spot instances, a cost effective option (using a bid price) to deploy clusters. Once your cluster is up and running it will have the latest Hail 0.2 version and Jupyter Lab installed. See sample file in the notebook folder, pre-loaded in Jupyter Lab for you to use as starting point.

IMPORTANT: Software requirements

This tool requires the following programs to be previously installed in your computer (see details in section Before getting started):

  • Python3, pip and some additional python libraries
  • Amazon's Command Line Interface (CLI) utility

To install the required software open a terminal and execute the following:

For MAC

# Installs homebrew
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
# Installs python3
brew install python3
# Upgrades pip
pip3 install --upgrade pip
#Installs additional libraries
sudo -H pip3 install boto3 pandas botocore paramiko pyyaml nose tornado
# If the previous command does not work, try the following
sudo -H python3 pip install boto3 pandas botocore paramiko pyyaml nose tornado
# Installs AWS CLI
brew install awscli

For Ubuntu

# Installs Linuxbrew
sudo apt-get -y install build-essential curl file git
echo 'export PATH="/home/linuxbrew/.linuxbrew/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Installs python3
brew install python3
# Upgrades pip
pip3 install --upgrade pip
#Installs additional libraries
sudo -H pip3 install boto3 pandas botocore paramiko pyyaml nose tornado
# If the previous command does not work, try the following
sudo -H python3 pip install boto3 pandas botocore paramiko pyyaml nose tornado
# Installs AWS CLI
brew install awscli

Before getting started

This tool is executed from the command line using Amazon's CLI utility. Before spinning gears, make sure you have:

a) A configured CLI account. From the terminal execute aws configure, click here for additional information. If your CLI account has been previously configured, the tool will use such configuration by default. If you want to re-configure and use a specific account or a different user, execute aws configure and re-configure your account

b) A valid EC2 key pair. Click here to learn more on how to create and use your key. Safety remark: once you have your key make sure to set the proper permissions for it: chmod 400 my-key.pem.

How to use this cloudformation tool

  1. Open a terminal and clone this repository: git clone https://github.com/hms-dbmi/hail-on-AWS-spot-instances

  2. Change directories: cd hail-on-AWS-spot-instances/src

  3. Using the text editor of your preference (sublime, atom, vi, emacs, etc) update the configuration file config_EMR_spot.yaml as per the instructions below. This file is your gateway to properly spinning a cluster and it requires specific elements to successfully create your working cluster. Before heading to step 4, follow the instructions explained beneath.

    Instructions to properly configure your config_EMR_spot.yaml file

    This file will be used to provide the necessary information to create the cluster (do not change the name of the file). Give a name to your EMR_CLUSTER_NAME and add meaningful information by properly identifying your EC2_NAME_TAG, OWNER_TAG and PROJECT_TAG. The file in the repo is defaulted to region us-east-1, one m4.large master node and two r4.4xlarge worker nodes. You can change all this parameters to whatever suits your application.

    config:
      EMR_CLUSTER_NAME: "my-hail-02-cluster" # Give a name to your EMR cluster
      EC2_NAME_TAG: "my-hail-EMR" # Adds a tag to the individual EC2 instances
      OWNER_TAG: "emr-owner" # EC2 owner tag
      PROJECT_TAG: "my-project" # Project tag
      REGION: "us-east-1"
      MASTER_INSTANCE_TYPE: "m4.large" # Suggested EC2 instances, change as desired 
      WORKER_INSTANCE_TYPE: "r4.xlarge" # Suggested EC2 instances, change as desired 
      WORKER_COUNT: "4" # Number of worker nodes
      WORKER_BID_PRICE: "0.44" # Required for spot instances
      MASTER_HD_SIZE: "50" # Size in GB - For large data sets, more HD space may be required
      WORKER_HD_SIZE: "150" # Size in GB - For large data sets, more HD space may be required (i.e. ~500GB for the 1KG Phase 3)
      SUBNET_ID: "" # This field can be either left blank or for further security you can specify your private subnet ID in the form: subnet-1a2b3c4d
      S3_BUCKET: "s3n://my-s3-bucket/" # Specify your S3 bucket for EMR log storage
      KEY_NAME: "my-key" # Input your key name ONLY! DO NOT include the .pem extension
      PATH_TO_KEY: "/full-path-to/my-key/" # # Full path to the FOLDER where the .pem file resides
      WORKER_SECURITY_GROUP: "" # If empty creates a new group by default. You can also add a specific SG. See the SG link in the FAQs section
      MASTER_SECURITY_GROUP: "" # If empty creates a new group by default. You can also add a specific SG. See the SG link in the FAQs section
      HAIL_VERSION: "current" # Specify a git hash version (the first 7-12 characters will suffice) to install a specific commit/version. When left empty or "current" will install the latest version of Hail available in the repo

    3.1. Select the EC2 instances for your MASTER_INSTANCE_TYPE and your WORKER_INSTANCE_TYPE. It is recommended to use a small generic EC2 for the master, such as m4.large, and more powerful EC2s (compute or memory optimized) for your worker nodes such as r4.4large or m4.4xlarge. Visit this link to see the different types of EC2 instances.

    Suggested EC2s (WORKER_INSTANCE_TYPE)
    c4.4xlarge
    r4.2xlarge
    r4.4xlarge
    m4.4xlarge
    i3.4xlarge

    Since we are using spot instances, the worker nodes require a maximum bid price to be specified. The field WORKER_BID_PRICE specifies the maximum cost that we will pay for each of the worker nodes. To choose an accurate and competitive bid price for your worker nodes, login to the EMR management console:

    Click on Create cluster:

    Then, click on Go to advanced options:

    You will be taken to Step 1: Software and Steps, click Next:

    Here, click on the instance type selection pencil (1) to find your worker node type. Within the list select your desired instance type and click on the Save button. Next, hover over the i icon (2) to show the current spot price for such instance:

    Prices vary based on demand and by the Subnet with its corresponding Availability Zone (subnet-053f834c and zone us-east-1a in this example), where the later dictates the bid price; a good practice is to identify the current prices per subnet/zone and just go slightly above such price to guarantee that you will be promptly provisioned with instances. Even though you specify a higher bid price, you will still pay less if a lower price is available for your zone. The example below shows a suggested bid of $0.44 for r4.4xlarge instances in zones 1a and 1c:

    3.2. For your SUBNET_ID you can either specify the subnet from the previous step (i.e. subnet-053f834c) or you can also choose a specific one from the VPC Dashboard, click on Subnets on the left panel:

    For instance pricing, follow the guidelines from step 3.1. The price is given by the zone where your subnet is located.

    3.3. The S3_BUCKET field specifies a location to store all the logs of your cluster (i.e. s3n://my-s3-bucket/). If you leave it blank ("") the log folder will be created under your S3 root folder. The log folder will have the same name as your automatically assigned EMR cluster ID (i.e. j-123EMRID3210)

    3.4. The KEY_NAME field must include the name of your key without the extension. If your key file is my-key.pem only put my-key. The PATH_TO_KEY field requires the full path pointing to the key file. For additional details upon your key scroll up to the Before getting started section in this repo.

    3.5. In order to specify the WORKER_SECURITY_GROUP and MASTER_SECURITY_GROUP go to the VPC Dashboard and from the left panel Security >> Security Groups . Note: if these two fields are left empty (default in the configuration file) the security groups are automatically assigned. IMPORTANT: to properly access Jupyter Lab from the browser, the port 8192 has to be added to the inbound rules of your MASTER_SECURITY_GROUP. To achieve this, and once you are in the Security Groups page, select your desired group:

    Click on the Inbound Rules tab to double check that ports 8192 and 22 are on the list. To add/edit port rules click on Edit rules and use one of the two configurations suggested below:

    Click here for additional documentation on security groups.

    3.6. In case you desire to perform analysis in Hail under a specific version, the option HAIL_VERSION accepts either the abbreviated or the full SHA-1 hash. The script will accept any hash between 7-40 characters. The default is "current". If the specific hash is not given or if it wasn't found, the latest available version will be installed.

  4. Once the configuration file is properly filled and saved, go back to the terminal and from the src folder hail-on-AWS-spot-instances/src execute the command: sh cloudformation_hail_spot.sh. The EMR cluster creation takes between 7-10 minutes (depending on EC2 availability). DO NOT terminate the script execution as you will automatically get the IP address to connect to the JypyterNotebook in the form: 123.456.0.1:8192. Here's a sample screenshot showing what you get once the cluster is successfully created:

(Optional) The full log of the EMR provisioning can be found at: /tmp/cloudcreation_log.out.

  1. You can check the status of the EMR creation at: https://console.aws.amazon.com/elasticmapreduce. The EMR is successfully created once it gets the Status Waiting and a solid green circle to the left of the cluster Name.

After the cluster is created, allow for automatic program installation and configuration (~5-8 minutes depending on the number of worker nodes). No additional action is required but to wait for the installation process to complete. (Optional) In addition, the script will also provide the public DNS to connect to the master node. Click here for instructions on how to connect to the master node (NOTE: use username hadoop) to monitor cluster progress and status (the program installation log at the master node of your EMR is saved at the path: /tmp/cloudcreation_log.out):

Launching Jupyter Lab

To launch Jupyter Lab you need to paste the previously given IP (123.456.0.1:8192 this is the master node's IP pointing to port 8192) in a browser and hit Enter; once you see the following screen:

use password: avillach to login. If you successfully log in, you are all set!

FAQs and troubleshooting

  • If after executing sh cloudformation_hail_spot.sh you get an error message saying that "variable cluster_id_json is out of range" it means that the CLI command aws emr create-cluster --applications Name=Hadoop Name=Spark ... did not retrieve a cluster ID. This error occurs due to different reasons: a defective AWS account configuration (aws configure), the user needs additional permits such as AmazonElasticMapReduce* or AmazonEC2*.

  • default IAM roles for EMR are not present in your account by default. If you have not created them, you may encounter the error

An error occurred (ValidationException) when calling the RunJobFlow operation: Invalid InstanceProfile: EMR_EC2_DefaultRole.

If this comes up, you can fix with the command:

aws emr create-default-roles
  • Some times you may get sudden or unexpected errors. One of the reasons may be the fact that your initial spot instances can be dropped and replaced by a new instance (that's how the spot instance model works). This cloudformation tool constantly --every minute-- checks for this behavior and will fix everything for you. A common error when an instance is replaced is:
FatalError: ClassNotFoundException: is.hail.kryo.HailKryoRegistrator
  • For this and other Jupyter Lab glitches, you only need to restart the kernel by clicking on Kernel >> Restart or Restart & Run All: