This project enables anyone with an Amazon Web Services ([AWS] (http://aws.amazon.com/)) account to quickly deploy a number of distributed technologies all from their laptop or personal computer. The installation is fairly basic and should not be used for production. The purpose of this project is to enable fast protoyping of various distributed data pipelines and also help others explore distributed technologies without the headache of installing them.
We want to continue improving this tool by adding more features and other installations, so send us your pull requests or suggestions!
- Install Pegasus on your local machine
- Spin up your cluster on AWS
- Fetching AWS cluster DNS and hostname information
- Setting up a newly provisioned AWS cluster
- Start installing!
- Terminate a cluster
- Deployment Pipelines
This will allow you to programatically interface with your AWS account
Clone the Pegasus project to your local computer and install Python dependencies (Python 2.7+ required)
$ git clone https://github.com/InsightDataScience/pegasus.git
$ cd pegasus
$ sudo pip install -e .
Installs the following Python packages
- boto3
- moto
- schema
Add your AWS credentials to ~/.bash_profile
and source it
export AWS_ACCESS_KEY_ID=XXXX
export AWS_SECRET_ACCESS_KEY=XXXX
$ . ~/.bash_profile
Create a new .aws
directory and create config
file with the location of your AWS instance:
mkdir ~/.aws
nano ~/.aws/config
Add the following to ~/.aws/config
based on the region of your AWS cluster:
[default]
region=us-east-1|us-west-1|us-west-2|eu-central-1|eu-west-1|ap-southeast-1|ap-southeast-2|ap-northeast-1|sa-east-1
You can test your boto3 AWS access by querying for the available regions for your AWS account:
$ python
>>> import boto3
>>> client = boto3.client('ec2')
>>> client.describe_regions()
{u'Regions': [{u'Endpoint': 'ec2.eu-west-1.amazonaws.com',
u'RegionName': 'eu-west-1'},
{u'Endpoint': 'ec2.ap-southeast-1.amazonaws.com',
u'RegionName': 'ap-southeast-1'},
{u'Endpoint': 'ec2.ap-southeast-2.amazonaws.com',
u'RegionName': 'ap-southeast-2'},
{u'Endpoint': 'ec2.eu-central-1.amazonaws.com',
u'RegionName': 'eu-central-1'},
{u'Endpoint': 'ec2.ap-northeast-1.amazonaws.com',
u'RegionName': 'ap-northeast-1'},
{u'Endpoint': 'ec2.us-east-1.amazonaws.com', u'RegionName': 'us-east-1'},
{u'Endpoint': 'ec2.sa-east-1.amazonaws.com', u'RegionName': 'sa-east-1'},
{u'Endpoint': 'ec2.us-west-1.amazonaws.com', u'RegionName': 'us-west-1'},
{u'Endpoint': 'ec2.us-west-2.amazonaws.com', u'RegionName': 'us-west-2'}],
'ResponseMetadata': {'HTTPStatusCode': 200,
'RequestId': '8949dfbe-63ab-4c0f-ba56-f9cd946de2ed'}}
Install [pytest] (http://pytest.org/latest/) and run tests from top directory
$ sudo pip install pytest
$ py.test
Currently all installations have only been tested on the Ubuntu Server 14.04 LTS (HVM) AMI.
- Use the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type AMI
- To start we recommend deploying a 4 node cluster
- Tag each instance with the same name through the AWS EC2 console (REQUIRED!!!)
- e.g. test-cluster ( in subsequent steps)
Or
Use ec2spinup to deploy a cluster from the command line
$ ./ec2spinup <instance-template-file>
The instance-template-file
is simply a JSON file that ec2spinup uses. Within this file you will must specify the following as shown:
{
"purchase_type": "spot"|"on_demand",
"region": "us-east-1"|"us-west-1"|"us-west-2"|"eu-central-1"|"eu-west-1"|"ap-southeast-1"|"ap-southeast-2"|"ap-northeast-1"|"sa-east-1",
"subnet": "string",
"image": "string",
"price": "string",
"num_instances": 4,
"key_name": "string",
"security_group_ids": [
"string"
],
"instance_type": "t1.micro"|"m1.small"|"m1.medium"|"m1.large"|"m1.xlarge"|"m3.medium"|"m3.large"|"m3.xlarge"|"m3.2xlarge"|"m4.large"|"m4.xlarge"|"m4.2xlarge"|"m4.4xlarge"|"m4.10xlarge"|"t2.micro"|"t2.small"|"t2.medium"|"t2.large"|"m2.xlarge"|"m2.2xlarge"|"m2.4xlarge"|"cr1.8xlarge"|"i2.xlarge'|"i2.2xlarge'|"i2.4xlarge"|"i2.8xlarge"|"hi1.4xlarge"|"hs1.8xlarge"|"c1.medium"|"c1.xlarge"|"c3.large"|"c3.xlarge"|"c3.2xlarge"|"c3.4xlarge"|"c3.8xlarge"|"c4.large"|"c4.xlarge"|"c4.2xlarge"|"c4.4xlarge"|"c4.8xlarge"|"cc1.4xlarge"|"cc2.8xlarge"|"g2.2xlarge"|"cg1.4xlarge"|"r3.large"|"r3.xlarge"|"r3.2xlarge"|"r3.4xlarge"|"r3.8xlarge"|"d2.xlarge"|"d2.2xlarge"|"d2.4xlarge"|"d2.8xlarge",
"tag_name": "string"
"vol_size": 100
}
- purchase_type (string) - choose between on-demand or spot instances
- region (string) - AWS region you wish to spin your cluster in
- subnet (string) - the VPC subnet id e.g. "subnet-61c12804"
- image (string) - the AMI id you would like to spin the instance up with e.g. "ami-df6a8b9b"
- price (string) - spot price you would like to set. Ignored if purchase type is "on_demand" e.g. "0.25"
- num_instances (integer) - number of instances to deploy
- key_name (string) - the pem key name to be used for all instances e.g. "insight-cluster"
- security_group_ids (list) - a list of the security group ids
- (string) e.g. "sg-e9f17e8c"
- instance_type (string) - type of instances to deploy
- tag_name (string) - tag all your instances with this name e.g. "test-cluster"
- vol_size (integer) - size of the EBS volume in GB. Uses magnetic storage
Once the nodes are up and running on AWS, we'll need to grab the DNS and hostname information about the cluster you wish to work with on your local machine. Make sure your .pem
key has the proper privelages:
$ chmod 600 ~/.ssh/<your-aws-pem-key>
Always run ec2fetch
to get the instance DNSs and hostnames before installation. DNSs and hostnames will be saved into the tmp
folder under the specified cluster name as public_dns
and hostnames
respectively
$ ./ec2fetch <region> <cluster-name>
Under the tmp/<cluster-name>
folder you will find the public_dns
and hostnames
files. The first record in each file is considered the Master node for any cluster technology that has a Master-Worker setup.
tmp/<cluster-name>/public_dns
ec2-52-32-227-84.us-west-2.compute.amazonaws.com **MASTER**
ec2-52-10-128-74.us-west-2.compute.amazonaws.com **WORKER1**
ec2-52-35-15-97.us-west-2.compute.amazonaws.com **WORKER2**
ec2-52-35-11-46.us-west-2.compute.amazonaws.com **WORKER3**
tmp/<cluster-name>/hostnames
ip-172-31-38-105 **MASTER**
ip-172-31-39-193 **WORKER1**
ip-172-31-42-254 **WORKER2**
ip-172-31-44-133 **WORKER3**
Once the cluster IPs have been saved to the tmp folder, we can begin with installations.
If this is a newly provisioned AWS cluster, always start with at least the following 3 steps in the following order before proceeding with other installations
- Environment/Packages on all machines - installs base packages for python, java, scala on all nodes in the cluster
- Passwordless SSH - enables passwordless SSH from your computer to the MASTER and the MASTER to all the WORKERS
- AWS Credentials - places AWS keys onto all machines
$ ./ec2install <cluster-name> environment
$ ./ec2install <cluster-name> ssh
$ ./ec2install <cluster-name> aws
Depending on what you decide to install in the environment step, the process could take anywhere from 10-30 minutes. If you wish to speed this up, we recommend that you bake an AMI using Packer and the environment installation script. This will cut down the time to spin up a new cluster significantly. Some examples are shown in the packer folder.
When you use the ec2spinup
script, you will need to change the instance JSON template to use the new AMI instead of the base Ubuntu 14.04 Trusty AMI
$ ./ec2install <cluster-name> <technology>
The technology
tag can be any of the following:
- cassandra (default v2.2.4)
- elasticsearch (default v2.1.0)
- Must have REGION and EC2_GROUP set as environment variables
- e.g. REGION=us-west-2
- e.g. EC2_GROUP=open
- flink (default v0.10.1 with hadoop v2.7 and scala v2.10)
- hadoop (default v2.7.1)
- hbase (default v1.1.2)
- hive (default v1.2.1)
- kafka (default v0.8.2.2 with scala v2.10)
- kibana (default v4.3.0)
- opscenter
- pig (default v0.15.0)
- presto (default v0.86)
- redis (default v3.0.6)
- spark (default v1.5.2 with hadoop v2.4+)
- storm (default v0.10.0)
- tachyon (default v0.8.2)
- zeppelin
- zookeeper (default v3.4.7)
If you wish to install a different version of these technologies, please go into the install/download_tech
script and update the technology version and technology binary download URL.
Additional technologies can be included into Pegasus by adding the technology version and url to install/download_tech
and also writing the appropriate configurations in the config
folder.
Tears down an on-demand or spot cluster on AWS
$ ./ec2terminate <region> <cluster-name>
If you'd like to automate this deployment process completely, you can write your own scripts. An example has been provided in the templates/pipelines/spark_hadoop.sh
file.
Here it shows how we can spin up a 4 node cluster (ec2spinup) using the example.json
instance template, grab the cluster information (ec2fetch) and install all the technologies (ec2install) in one script. We can deploy this cluster simply by running the following:
$ templates/pipelines/spark_hadoop.sh
#!/bin/bash
CLUSTER_NAME=test-cluster
REGION=us-west-2
./ec2spinup templates/instances/example.json
./ec2fetch $REGION $CLUSTER_NAME
./ec2install $CLUSTER_NAME environment
./ec2install $CLUSTER_NAME ssh
./ec2install $CLUSTER_NAME aws
./ec2install $CLUSTER_NAME hadoop
./ec2install $CLUSTER_NAME hive
./ec2install $CLUSTER_NAME pig
./ec2install $CLUSTER_NAME spark