This is a project I created for the Big Data Systems & Techniques
course of my MSc in Data Science.
It's a very basic implementation of an orchestration systems that provisions and configures a 3-node cluster (the number of data-nodes can be easily extended) with Apache Hadoop and Apache Spark.
The task of this project is to use the Terraform IAC (Infrastructure as Code) tool to automatically provision Amazon VMs and install Hadoop on the cluster. The resources used are:
- Linux image:
Ubuntu 16.04
- Java:
jdk1.8.0_131
- Apache Hadoop:
hadoop-2.7.2
- Apache Spark:
spark-2.1.1
Infrastructure as code is a new DevOps philosophy where the application infrastructure is no longer created by hand but programmatically. The benefits are numerous including but not limited to:
- Speed of deployment
- Version Control of Infrastructure
- Engineer agnostic infrastructure (no single point of failure/no single person to bug)
- Better lifetime management (automatic scale up/down, healing)
- Cross-provider deployment with minimal changes
Terraform is a tool that helps in this direction. It is an open source tool developed by Hashicorp.
This tool allows you to write the final state that you wish your infrastructure to have and terraform applies those changes for you.
You can provision VMs, create subnets, assign security groups and pretty much perform any action that any cloud provider allows.
Terraform support a wide range of providers including the big 3 ones AWS, GCP, Microsoft Azure.
Terraform is written in Go and is provided as a binary for the major OSs but can also be compiled from source code.
The binary can be downloaded from the Terraform site and does not require any installation. We just need to set it to the path variable (for Linux/macOS instructionscan be found here and for Windows here) so that it is accessible from our system in any path.
After we have this has finished we can confirm that it is ready to be used by running the terraform command and we should get something like the following:
$ terraform
Usage: terraform [--version] [--help] <command> [args]
The available commands for execution are listed below.
The most common, useful commands are shown first, followed by
less common or more advanced commands. If you're just getting
started with Terraform, stick with the common commands. For the
other commands, please read the help and docs before usage.
Common commands:
apply Builds or changes infrastructure
console Interactive console for Terraform interpolations
destroy Destroy Terraform-managed infrastructure
env Environment management
fmt Rewrites config files to canonical format
get Download and install modules for the configuration
graph Create a visual graph of Terraform resources
import Import existing infrastructure into Terraform
init Initialize a new or existing Terraform configuration
output Read an output from a state file
plan Generate and show an execution plan
push Upload this Terraform module to Atlas to run
refresh Update local state file against real resources
show Inspect Terraform state or plan
taint Manually mark a resource for recreation
untaint Manually unmark a resource as tainted
validate Validates the Terraform files
version Prints the Terraform version
All other commands:
debug Debug output management (experimental)
force-unlock Manually unlock the terraform state
state Advanced state management
Now we can move on the using the tool.
This is a step that is not specific to this project but rather it's something that needs to be configured whenever a new AWS account is set up. When we create a new account with Amazon, the default account we are given has root access to any action. Similarly with the linux root user we do not want to be using this account for the day-to-day actions, so we need to create a new user.
We navigate to the Identity and Access Management (IAM) page, click on Users
, then the Add user
button. We provide the User name, and click the Programmatic access checkbox so that an access key ID and a secret access key will be generated.
Clicking next we are asked to provide a Security Group that this User will belong to. Security Groups are the main way to provide permission and restrict access to specific actions required. For this purpose of this project we will give the AdministratorAccess
permission to this user, however when used in a professional setting it is advised to only allow permissions that a user needs (like AmazonEC2FullAccess if a user will only be creating EC2 instances).
Finishing the review step Amazon will provide the Access key ID and Secret access key. We will provide these to terraform to grant it access to create the resources for us. We need to keep these as they are only provided once and cannot be retrieved (however we can always create a new pair).
The secure way to store these credentials as recommended by Amazon is keeping them in a hidden folder under a file called credentials
. This file can be accessed by terraform to retrieve them.
$ cd
$ mkdir .aws
$ cd .aws
~/.aws$ vim credentials
We add the following to the credentials file after replacing ACCESS_KEY
and SECRET_KEY
and then save it:
[default]
aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY
We also restrict access to this file only to the current user:
~/.aws$ chmod 600 credentials
The next step is to create a key pair so that terraform can access the newly created VMS. Notice that this is different than the above credentials. The Amazon credentials are for accessing and allowing the AWS service to create the resources required, while this key pair will be used for accessing the new instances.
Log into the AWS console and select Create Key Pair
. Add a name and click Create
. AWS will create a .pem file and download it locally.
Move this file to the .aws
directory.
~/Downloads$ mv ssh-key.pem ../.aws/
The restrict the permissions:
$ chmod 400 ssh-key.pem
Now we ready to use this key pair either via a direct ssh to our instances, or for terraform to use this to connect to the instances and run some scripts.
The following terraform script is responsible for the creation of the VM instances, copying the relevant keys to give us access to them as well as run the startup script that configures the nodes.
We run this with terraform plan
and terraform informs us about the changes it's going to make:
$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed. Cyan entries are data sources to be read.
Note: You didn't specify an "-out" parameter to save this plan, so when
"apply" is called, Terraform can't guarantee this is what will execute.
+ aws_instance.Datanode.0
ami: "ami-a8d2d7ce"
associate_public_ip_address: "<computed>"
availability_zone: "<computed>"
ebs_block_device.#: "<computed>"
ephemeral_block_device.#: "<computed>"
instance_state: "<computed>"
instance_type: "t2.micro"
ipv6_address_count: "<computed>"
ipv6_addresses.#: "<computed>"
key_name: "ssh-key"
network_interface.#: "<computed>"
network_interface_id: "<computed>"
placement_group: "<computed>"
primary_network_interface_id: "<computed>"
private_dns: "<computed>"
private_ip: "172.31.32.102"
public_dns: "<computed>"
public_ip: "<computed>"
root_block_device.#: "<computed>"
security_groups.#: "<computed>"
source_dest_check: "true"
subnet_id: "<computed>"
tags.%: "1"
tags.Name: "s02"
tenancy: "<computed>"
volume_tags.%: "<computed>"
vpc_security_group_ids.#: "<computed>"
+ aws_instance.Datanode.1
ami: "ami-a8d2d7ce"
associate_public_ip_address: "<computed>"
availability_zone: "<computed>"
ebs_block_device.#: "<computed>"
ephemeral_block_device.#: "<computed>"
instance_state: "<computed>"
instance_type: "t2.micro"
ipv6_address_count: "<computed>"
ipv6_addresses.#: "<computed>"
key_name: "ssh-key"
network_interface.#: "<computed>"
network_interface_id: "<computed>"
placement_group: "<computed>"
primary_network_interface_id: "<computed>"
private_dns: "<computed>"
private_ip: "172.31.32.103"
public_dns: "<computed>"
public_ip: "<computed>"
root_block_device.#: "<computed>"
security_groups.#: "<computed>"
source_dest_check: "true"
subnet_id: "<computed>"
tags.%: "1"
tags.Name: "s03"
tenancy: "<computed>"
volume_tags.%: "<computed>"
vpc_security_group_ids.#: "<computed>"
+ aws_instance.Namenode
ami: "ami-a8d2d7ce"
associate_public_ip_address: "<computed>"
availability_zone: "<computed>"
ebs_block_device.#: "<computed>"
ephemeral_block_device.#: "<computed>"
instance_state: "<computed>"
instance_type: "t2.micro"
ipv6_address_count: "<computed>"
ipv6_addresses.#: "<computed>"
key_name: "ssh-key"
network_interface.#: "<computed>"
network_interface_id: "<computed>"
placement_group: "<computed>"
primary_network_interface_id: "<computed>"
private_dns: "<computed>"
private_ip: "172.31.32.101"
public_dns: "<computed>"
public_ip: "<computed>"
root_block_device.#: "<computed>"
security_groups.#: "<computed>"
source_dest_check: "true"
subnet_id: "<computed>"
tags.%: "1"
tags.Name: "s01"
tenancy: "<computed>"
volume_tags.%: "<computed>"
vpc_security_group_ids.#: "<computed>"
+ aws_security_group.instance
description: "Managed by Terraform"
egress.#: "1"
egress.482069346.cidr_blocks.#: "1"
egress.482069346.cidr_blocks.0: "0.0.0.0/0"
egress.482069346.from_port: "0"
egress.482069346.ipv6_cidr_blocks.#: "0"
egress.482069346.prefix_list_ids.#: "0"
egress.482069346.protocol: "-1"
egress.482069346.security_groups.#: "0"
egress.482069346.self: "false"
egress.482069346.to_port: "0"
ingress.#: "4"
ingress.2214680975.cidr_blocks.#: "1"
ingress.2214680975.cidr_blocks.0: "0.0.0.0/0"
ingress.2214680975.from_port: "80"
ingress.2214680975.ipv6_cidr_blocks.#: "0"
ingress.2214680975.protocol: "tcp"
ingress.2214680975.security_groups.#: "0"
ingress.2214680975.self: "false"
ingress.2214680975.to_port: "80"
ingress.2319052179.cidr_blocks.#: "1"
ingress.2319052179.cidr_blocks.0: "0.0.0.0/0"
ingress.2319052179.from_port: "9000"
ingress.2319052179.ipv6_cidr_blocks.#: "0"
ingress.2319052179.protocol: "tcp"
ingress.2319052179.security_groups.#: "0"
ingress.2319052179.self: "false"
ingress.2319052179.to_port: "9000"
ingress.2541437006.cidr_blocks.#: "1"
ingress.2541437006.cidr_blocks.0: "0.0.0.0/0"
ingress.2541437006.from_port: "22"
ingress.2541437006.ipv6_cidr_blocks.#: "0"
ingress.2541437006.protocol: "tcp"
ingress.2541437006.security_groups.#: "0"
ingress.2541437006.self: "false"
ingress.2541437006.to_port: "22"
ingress.3302755614.cidr_blocks.#: "1"
ingress.3302755614.cidr_blocks.0: "0.0.0.0/0"
ingress.3302755614.from_port: "50010"
ingress.3302755614.ipv6_cidr_blocks.#: "0"
ingress.3302755614.protocol: "tcp"
ingress.3302755614.security_groups.#: "0"
ingress.3302755614.self: "false"
ingress.3302755614.to_port: "50010"
name: "Namenode-instance"
owner_id: "<computed>"
vpc_id: "<computed>"
Plan: 4 to add, 0 to change, 0 to destroy.
Then we run terraform apply
to start the creation of our resources. Once we are done we can see that terraform has output the dns name of the master node so that we can login to it and start out services.
In order to remove all resources we run terraform destroy
.
While using the above bash script is OK for a small project, we want to use a more advanced configuration tool is we are going to use terraform in production.
There are many choices here, with the main being Chef
that terraform supports natively, however all we can use the rest of the major tools like Ansible, Puppet etc if they are installed in our local terraform machine.
Furthermore Terraform suggests creating custom images using the Packer tool. Customer images are built using the base linux (or any other image) after we have added the software required. This is then packaged into a single image which is loaded into the VM ready to be used. This saves both time, as well as bandwidth when creating the infrastructure.
As mentioned above the goal is to make this more customizable regarding the number of nodes that can be created, the versions of Java, Hadoop, Spark used, as well as the instance type of the nodes.