/hadoop-spark-cluster-deployment

Automated deployment of a Hadoop/Spark Cluster on AWS with Terraform

Primary LanguageShellMIT LicenseMIT

Automatic Deployment of a Hadoop-Spark Cluster using Terraform


This is a project I created for the Big Data Systems & Techniques course of my MSc in Data Science. It's a very basic implementation of an orchestration systems that provisions and configures a 3-node cluster (the number of data-nodes can be easily extended) with Apache Hadoop and Apache Spark.

Project's Task

The task of this project is to use the Terraform IAC (Infrastructure as Code) tool to automatically provision Amazon VMs and install Hadoop on the cluster. The resources used are:

  • Linux image: Ubuntu 16.04
  • Java: jdk1.8.0_131
  • Apache Hadoop: hadoop-2.7.2
  • Apache Spark: spark-2.1.1

What is Infrastructure As Code and what is Terraform?

Infrastructure as code is a new DevOps philosophy where the application infrastructure is no longer created by hand but programmatically. The benefits are numerous including but not limited to:

  • Speed of deployment
  • Version Control of Infrastructure
  • Engineer agnostic infrastructure (no single point of failure/no single person to bug)
  • Better lifetime management (automatic scale up/down, healing)
  • Cross-provider deployment with minimal changes

Terraform is a tool that helps in this direction. It is an open source tool developed by Hashicorp.

This tool allows you to write the final state that you wish your infrastructure to have and terraform applies those changes for you.

You can provision VMs, create subnets, assign security groups and pretty much perform any action that any cloud provider allows.

Terraform support a wide range of providers including the big 3 ones AWS, GCP, Microsoft Azure.

Installing Terraform

Terraform is written in Go and is provided as a binary for the major OSs but can also be compiled from source code.

The binary can be downloaded from the Terraform site and does not require any installation. We just need to set it to the path variable (for Linux/macOS instructionscan be found here and for Windows here) so that it is accessible from our system in any path.

After we have this has finished we can confirm that it is ready to be used by running the terraform command and we should get something like the following:

$ terraform
Usage: terraform [--version] [--help] <command> [args]

The available commands for execution are listed below.
The most common, useful commands are shown first, followed by
less common or more advanced commands. If you're just getting
started with Terraform, stick with the common commands. For the
other commands, please read the help and docs before usage.

Common commands:
    apply              Builds or changes infrastructure
    console            Interactive console for Terraform interpolations
    destroy            Destroy Terraform-managed infrastructure
    env                Environment management
    fmt                Rewrites config files to canonical format
    get                Download and install modules for the configuration
    graph              Create a visual graph of Terraform resources
    import             Import existing infrastructure into Terraform
    init               Initialize a new or existing Terraform configuration
    output             Read an output from a state file
    plan               Generate and show an execution plan
    push               Upload this Terraform module to Atlas to run
    refresh            Update local state file against real resources
    show               Inspect Terraform state or plan
    taint              Manually mark a resource for recreation
    untaint            Manually unmark a resource as tainted
    validate           Validates the Terraform files
    version            Prints the Terraform version

All other commands:
    debug              Debug output management (experimental)
    force-unlock       Manually unlock the terraform state
    state              Advanced state management

Now we can move on the using the tool.

Setting up the AWS account

This is a step that is not specific to this project but rather it's something that needs to be configured whenever a new AWS account is set up. When we create a new account with Amazon, the default account we are given has root access to any action. Similarly with the linux root user we do not want to be using this account for the day-to-day actions, so we need to create a new user.

We navigate to the Identity and Access Management (IAM) page, click on Users, then the Add user button. We provide the User name, and click the Programmatic access checkbox so that an access key ID and a secret access key will be generated.

Clicking next we are asked to provide a Security Group that this User will belong to. Security Groups are the main way to provide permission and restrict access to specific actions required. For this purpose of this project we will give the AdministratorAccess permission to this user, however when used in a professional setting it is advised to only allow permissions that a user needs (like AmazonEC2FullAccess if a user will only be creating EC2 instances).

Finishing the review step Amazon will provide the Access key ID and Secret access key. We will provide these to terraform to grant it access to create the resources for us. We need to keep these as they are only provided once and cannot be retrieved (however we can always create a new pair).

The secure way to store these credentials as recommended by Amazon is keeping them in a hidden folder under a file called credentials. This file can be accessed by terraform to retrieve them.

$ cd
$ mkdir .aws
$ cd .aws
~/.aws$ vim credentials

We add the following to the credentials file after replacing ACCESS_KEY and SECRET_KEY and then save it:

[default]
aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY

We also restrict access to this file only to the current user:

~/.aws$ chmod 600 credentials 

Setting up a key pair

The next step is to create a key pair so that terraform can access the newly created VMS. Notice that this is different than the above credentials. The Amazon credentials are for accessing and allowing the AWS service to create the resources required, while this key pair will be used for accessing the new instances.

Log into the AWS console and select Create Key Pair. Add a name and click Create. AWS will create a .pem file and download it locally.

Move this file to the .aws directory.

~/Downloads$ mv ssh-key.pem ../.aws/

The restrict the permissions:

$ chmod 400 ssh-key.pem

Now we ready to use this key pair either via a direct ssh to our instances, or for terraform to use this to connect to the instances and run some scripts.

Provisioning VMs & Configuring Them

The following terraform script is responsible for the creation of the VM instances, copying the relevant keys to give us access to them as well as run the startup script that configures the nodes.

We run this with terraform plan and terraform informs us about the changes it's going to make:

$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed. Cyan entries are data sources to be read.

Note: You didn't specify an "-out" parameter to save this plan, so when
"apply" is called, Terraform can't guarantee this is what will execute.

+ aws_instance.Datanode.0
    ami:                          "ami-a8d2d7ce"
    associate_public_ip_address:  "<computed>"
    availability_zone:            "<computed>"
    ebs_block_device.#:           "<computed>"
    ephemeral_block_device.#:     "<computed>"
    instance_state:               "<computed>"
    instance_type:                "t2.micro"
    ipv6_address_count:           "<computed>"
    ipv6_addresses.#:             "<computed>"
    key_name:                     "ssh-key"
    network_interface.#:          "<computed>"
    network_interface_id:         "<computed>"
    placement_group:              "<computed>"
    primary_network_interface_id: "<computed>"
    private_dns:                  "<computed>"
    private_ip:                   "172.31.32.102"
    public_dns:                   "<computed>"
    public_ip:                    "<computed>"
    root_block_device.#:          "<computed>"
    security_groups.#:            "<computed>"
    source_dest_check:            "true"
    subnet_id:                    "<computed>"
    tags.%:                       "1"
    tags.Name:                    "s02"
    tenancy:                      "<computed>"
    volume_tags.%:                "<computed>"
    vpc_security_group_ids.#:     "<computed>"

+ aws_instance.Datanode.1
    ami:                          "ami-a8d2d7ce"
    associate_public_ip_address:  "<computed>"
    availability_zone:            "<computed>"
    ebs_block_device.#:           "<computed>"
    ephemeral_block_device.#:     "<computed>"
    instance_state:               "<computed>"
    instance_type:                "t2.micro"
    ipv6_address_count:           "<computed>"
    ipv6_addresses.#:             "<computed>"
    key_name:                     "ssh-key"
    network_interface.#:          "<computed>"
    network_interface_id:         "<computed>"
    placement_group:              "<computed>"
    primary_network_interface_id: "<computed>"
    private_dns:                  "<computed>"
    private_ip:                   "172.31.32.103"
    public_dns:                   "<computed>"
    public_ip:                    "<computed>"
    root_block_device.#:          "<computed>"
    security_groups.#:            "<computed>"
    source_dest_check:            "true"
    subnet_id:                    "<computed>"
    tags.%:                       "1"
    tags.Name:                    "s03"
    tenancy:                      "<computed>"
    volume_tags.%:                "<computed>"
    vpc_security_group_ids.#:     "<computed>"

+ aws_instance.Namenode
    ami:                          "ami-a8d2d7ce"
    associate_public_ip_address:  "<computed>"
    availability_zone:            "<computed>"
    ebs_block_device.#:           "<computed>"
    ephemeral_block_device.#:     "<computed>"
    instance_state:               "<computed>"
    instance_type:                "t2.micro"
    ipv6_address_count:           "<computed>"
    ipv6_addresses.#:             "<computed>"
    key_name:                     "ssh-key"
    network_interface.#:          "<computed>"
    network_interface_id:         "<computed>"
    placement_group:              "<computed>"
    primary_network_interface_id: "<computed>"
    private_dns:                  "<computed>"
    private_ip:                   "172.31.32.101"
    public_dns:                   "<computed>"
    public_ip:                    "<computed>"
    root_block_device.#:          "<computed>"
    security_groups.#:            "<computed>"
    source_dest_check:            "true"
    subnet_id:                    "<computed>"
    tags.%:                       "1"
    tags.Name:                    "s01"
    tenancy:                      "<computed>"
    volume_tags.%:                "<computed>"
    vpc_security_group_ids.#:     "<computed>"

+ aws_security_group.instance
    description:                           "Managed by Terraform"
    egress.#:                              "1"
    egress.482069346.cidr_blocks.#:        "1"
    egress.482069346.cidr_blocks.0:        "0.0.0.0/0"
    egress.482069346.from_port:            "0"
    egress.482069346.ipv6_cidr_blocks.#:   "0"
    egress.482069346.prefix_list_ids.#:    "0"
    egress.482069346.protocol:             "-1"
    egress.482069346.security_groups.#:    "0"
    egress.482069346.self:                 "false"
    egress.482069346.to_port:              "0"
    ingress.#:                             "4"
    ingress.2214680975.cidr_blocks.#:      "1"
    ingress.2214680975.cidr_blocks.0:      "0.0.0.0/0"
    ingress.2214680975.from_port:          "80"
    ingress.2214680975.ipv6_cidr_blocks.#: "0"
    ingress.2214680975.protocol:           "tcp"
    ingress.2214680975.security_groups.#:  "0"
    ingress.2214680975.self:               "false"
    ingress.2214680975.to_port:            "80"
    ingress.2319052179.cidr_blocks.#:      "1"
    ingress.2319052179.cidr_blocks.0:      "0.0.0.0/0"
    ingress.2319052179.from_port:          "9000"
    ingress.2319052179.ipv6_cidr_blocks.#: "0"
    ingress.2319052179.protocol:           "tcp"
    ingress.2319052179.security_groups.#:  "0"
    ingress.2319052179.self:               "false"
    ingress.2319052179.to_port:            "9000"
    ingress.2541437006.cidr_blocks.#:      "1"
    ingress.2541437006.cidr_blocks.0:      "0.0.0.0/0"
    ingress.2541437006.from_port:          "22"
    ingress.2541437006.ipv6_cidr_blocks.#: "0"
    ingress.2541437006.protocol:           "tcp"
    ingress.2541437006.security_groups.#:  "0"
    ingress.2541437006.self:               "false"
    ingress.2541437006.to_port:            "22"
    ingress.3302755614.cidr_blocks.#:      "1"
    ingress.3302755614.cidr_blocks.0:      "0.0.0.0/0"
    ingress.3302755614.from_port:          "50010"
    ingress.3302755614.ipv6_cidr_blocks.#: "0"
    ingress.3302755614.protocol:           "tcp"
    ingress.3302755614.security_groups.#:  "0"
    ingress.3302755614.self:               "false"
    ingress.3302755614.to_port:            "50010"
    name:                                  "Namenode-instance"
    owner_id:                              "<computed>"
    vpc_id:                                "<computed>"


Plan: 4 to add, 0 to change, 0 to destroy.

Then we run terraform apply to start the creation of our resources. Once we are done we can see that terraform has output the dns name of the master node so that we can login to it and start out services.

In order to remove all resources we run terraform destroy.

Using Configuration Tools

While using the above bash script is OK for a small project, we want to use a more advanced configuration tool is we are going to use terraform in production. There are many choices here, with the main being Chef that terraform supports natively, however all we can use the rest of the major tools like Ansible, Puppet etc if they are installed in our local terraform machine.

Furthermore Terraform suggests creating custom images using the Packer tool. Customer images are built using the base linux (or any other image) after we have added the software required. This is then packaged into a single image which is loaded into the VM ready to be used. This saves both time, as well as bandwidth when creating the infrastructure.

Future improvements

As mentioned above the goal is to make this more customizable regarding the number of nodes that can be created, the versions of Java, Hadoop, Spark used, as well as the instance type of the nodes.

Resources: