/terraform-emr-spark-example

An example Terraform project that will configure a Secure and Customizable Spark Cluster on Amazon EMR.

Primary LanguageHCLApache License 2.0Apache-2.0

Spark on Amazon EMR + Terraform

An example Terraform project that will configure a Secure and Customizable Spark Cluster on Amazon EMR (EMR). Zeppelin is also installed as an interface to Spark, and Ganglia is also installed for monitoring.

Features

This project gives an example of extending the base functionality of Amazon EMR to provide a more secure (and potentially compliant) working environment for running Spark workloads on Amazon EMR.

At-Rest Encryption

There's two places in this project where data is stored: in Amazon S3 and in Hadoop HDFS, running on the EMR nodes.

Amazon S3 buckets are configured with AES-256 encryption, using your Amazon account's default encryption key. A custom KMS key could be used if desired.

Hadoop HDFS is configured to use EMR's Security Configuration which will configure the EMR nodes to enable LUKS to encrypt the data on the EBS volumes using an custom KMS Encryption Key.

In-Flight Encryption

This project will create self-signed certificates to demonstrate in-flight encryption. The certificates are used in three places:

Ideally, you or your organization would swap out a self-signed certificate for a certificate generated by a trusted certificate authority.

Auditing

It is important to track who accesses which data and when. This project demonstrates this in a couple ways:

  • EMR Logging: EMR has an internal tool called "Logpusher" which will send files written to /var/log/ to the S3 Bucket created and configured in the EMR Module.
  • Zeppelin Logging: A log4j configuration is placed on the system to log actions against Zeppelin.
  • S3 Bucket Logging: A second bucket is created to track object access (gets/puts/deletes) to the main S3 bucket.
  • Zeppelin Notebooks: Zeppelin notebooks are configured to save to the S3 Bucket.

Identity

This project also provides an example of how to instrument identity with Zeppelin.

Zeppelin employs Shiro. By providing a shiro.ini file, Zeppelin can have a user database. This project employs a basic example by hard-coding usernames and passwords into the file, but it could be extended to connect to an LDAP server or potentially another Identity Service.

Infrastructure As Code

Having your infrastructure represented "as code" means that you can use standard code review practices and continuous integration methodologies to make changes to your infrastructure. Under the Apache 2.0 License, you are welcome to fork this repository and customize it to you or your organization's needs.

By tracking what changes are made and by who and when, you can easily audit and control changes to your infrastructure.

Architecture

This project leverages Terraform Modules, and relies heavily on the EMR Cluster resource.

The layout of this project is as follows:

main.tf      --> The main terraform files, this includes the modules listed below
config.tf    --> General Terraform configuration, versions, etc.
variables.tf --> Variables needed for Terraform to execute, which also includes
                 defaults
outputs.tf   --> The output of the ELB's address for the Master Node
modules/
    bootstrap/  
        --> This module copies files to S3 so EMR can run a script right after
            the EC2 instances are provisioned
    emr/
        --> This module creates the EMR cluster and configuration files
    lb/
        --> This module creates a Load Balancer so Zepplein is accessible from
            your system
    s3/
        --> This module creates S3 buckets needed by EMR
    sec/
        --> This module creates some Security Fundamentals needed by EMR.
            NOTE: Please DO NOT USE THIS in production, its in place purely 
            for demonstrative purposes.  See "Security Module" below.
    sgs/
        --> This module creates the Security Groups needed for EMR and the Load
            Balancers.

Building

Before building, ensure you're comfortable with how terraform works.

Pre-Requisities

Terraform will use the AWS credentials provided in your shell environment. You will need an AWS user account available that has the following permissions:

  • View/Create/Update/Delete IAM, KMS, and Certificates
  • View/Create/Update/Delete S3 Buckets and Objects
  • View/Create/Update/Delete Security Groups
  • View/Create/Update/Delete Load Balancers
  • View VPC Subnets

Initalization

Before executing your first plan, you need to initialize Terraform:

$> terraform init

Planning

Terraform allows you to "Plan", which allows you to see what it would change without actually making any changes.

$> terraform plan -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1' 

Applying

Finally, affer initialization, planning, you can apply your chages, which will actaully create or update your cluster, based on the plan:

$> terraform apply -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1'

Destroying

If you want Terraform to clean up anything it made, you can destroy the cluster:

$> terraform destroy -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1'

Security Module

This project provides a Security Module that is intendended to be replaced with the security requirements of your organization. PLEASE DO NOT use this Security Module in production. It is there for demonstrative purposes only, but can be a guide on what needs to be replaced to use this project in production.

It provides bootstrapping for the following:

  • IAM Roles
    • Three IAM roles are created, one for EMR to create infrastructure, a role (instance profile) to attach to each one of the EC2 instances, and a role for EMR to use for autoscaling.
  • Network
    • Whitelisting: Instead of opening your EMR cluster to the world, this module will look up your public IP using ifconfig.co and adding that IP to a few security groups (SSH for the Nodes and HTTPS for the LB)
    • Subnets: This module will fetch all subnets in the VPC provided as a variable, it will spin up EMR in the first, and it will attach the LB to the rest. You will likely want to specify which subnets to use for both.
  • SSH
    • It will create and save a SSH key to connect to the cluster. The private SSH key is saved in generated/ssh/.
  • Zeppelin
    • Zeppelin is equipped with a Load Balancer for (easier) access. It also is equipped with a Self-Signed Certificate for encrypted communication between the ELB and the Zeppelin process, and another Self-Signed Certifcate for the Internet. You should not use Self-Signed Certificates in Production, and switch these out with valid certificates.

FAQs

I've changed something in bootstrap, why didn't that get applied?

Terraform isn't the best at tracking changes to resources in the bootstrap module. Sometimes you have to let Terraform know you need to destroy and recreate the EMR cluster by executing the following command:

$> terraform taint -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1' -module=emr aws_emr_cluster.cluster
$> terraform apply -var 'vpc_id=vpc-abcde123' -var 'cluster_name=my_emr_cluster_1'

I want this to run in another region, what do I do?

Currently this project is hard-coded to run in AWS US-West-2 (Oregon), and there are two things you have to change to make it run in another region:

How do I login to Zeppelin?

After the cluster builds, it will output the DNS Name of the Load Balancer that was created:

$> terraform apply...
...
Outputs:

dns_name = my_emr_cluster_1-default-1234567890.us-west-2.elb.amazonaws.com

You can navigate to https://my_emr_cluster_1-default-1234567890.us-west-2.elb.amazonaws.com in your browser. You will have to ignore the certificate warning, since this example project creates self-signed SSL certs for demonstrative purposes.

Finally you can find the Username and Password for Zepplein, hard-coded in the Shiro Configuration File.

NOTE: Please DO NOT hard code your Usernames and Passwords for Zeppelin in production, or check them into Git. This is in place purely for demonstrative purposes. Zeppelin offers a few Authentication Options.

How do I SSH into this cluster?

For demonstrative purposes, SSH is allowed to the public IP of the system that runs Terraform.

Also, this project will create an SSH key to connecto to the cluster. After Terraform is applied, the SSH key generated is placed in the generated/ folder, so you can SSH into the cluster with the following command:

$> ssh -i generated/ssh/my_emr_cluster_1-default ip_address_of_a_node

AWS also provides additional SSH connection help in the EMR Console

The Terraform state file is saved locally, how do I share that with others?

It is recommended to use Terraform Remote State

The cluster failed to create, where do I find logs?

  • First, navigate to EMR Console
  • Locate your cluster, it should be named after your cluster_name and region, e.g. my_emr_cluster1-us-west-2
  • Second, there are logs in the S3 bucket created, under logs/
    • There's a shortcut in the UI to find the log directory, click on the the folder icon near Log URI:

How do I test that everything is working?

  • You can visit the Zeppelin UI (See the "How do I login to Zeppelin?" FAQ above)
  • Second, follow the Apache Zeppelin Tutorial

Author

Christian Nuss

Collective Health SRE

https://github.com/cnuss

Copyright, License & Disclaimers

Copyright 2018 Collective Health, Inc

This project is available under the ApacheV2 License. Please see LICENSE file.