Using SparkNow you can rapidly deploy, scale and tear down your Spark clusters on OpenStack. Deploying a SparkNow cluster you will get:
- A Spark cluster up and running
- A HDFS cluster to store your data
- A Jupyter notebook for interactive Spark tasks
- A Apache Zeppelin notebook for interactive Spark tasks
SparkNow uses Packer (https://www.packer.io/) and Terraform (https://www.terraform.io/), to build its OpenStack image and to provision the cluster. Please install both of them on your local machine, following the instruction on their websites.
To get SparkNow just clone this repository.
git clone https://github.com/mcapuccini/SparkNow.git
To build SparkNow on your OpenStack tenancy, first export the following environment variables on your local machine.
export SPARK_DOWNLOAD_URL="http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz"
# you can change the download URL if you need another Spark version, everything should work
# as long as the binary is compatible with Hadoop 2.7
export PACKER_IMAGE_NAME="SparkNow_spark-2.1.0-hadoop2.7"
export PACKER_SOURCE_IMAGE_NAME="Ubuntu 14.04" #this may be different in your OpenStack tenancy
export PACKER_NETWORK="" # your OpenStack tenancy private network id
export PACKER_FLAVOR="" # the instance flavor that you want to use to build SparkNow
export PACKER_AVAILABILITY_ZONE="" # an availability zone name in your OpenStack tenancy
export PACKER_FLOATING_IP_POOL="" # a floating IP pool in your OpenStack tenancy
Then, access your OpenStack tenancy through the web interface, download the OpenStack RC file (Compute > Access & Security > API Access & Security > Download OpenStack RC FILE) and source it.
source someproject-openrc.sh # you will be asked to type your password
Finally, locate in the SparkNow directory, and run Packer to build the SparkNow image.
cd SparkNow/
packer build packer/build.json # you will be asked to type your password
If everything goes well, you will see the new image in the OpenStack web interface (Compute > Images).
First create a conf.tfvars
file, specifying some properties for the Spark cluster that you aim to deploy.
conf.tfvars
keypair_name = "your-keypair"
cluster_prefix = "SparkNow"
floating_ip_pool = ""
network_name = ""
SparkNow_image_name = "SparkNow_spark-2.1.0-hadoop2.7"
master_flavor_name = ""
worker_flavor_name = ""
worker_count = "3"
worker_volume_size = "20"
master_volume_size = "10"
- keypair_name: name of a key pair that you previously created, using the OpenStack web interface (Compute > Access & Security > Key Pairs).
- cluster_prefix: prefix for the resources that will be created in your OpenStack tenancy
- floating_ip_pool: a floating IP pool in your OpenStack tenancy
- network_name: an existing private network name (where the instances will be attached)
- SparkNow_image_name: the name of the SparkNow image that you built in the previous step
- master_flavor_name: the Spark master instance flavor
- worker_flavor_name: the Spark worker instance flavor
- worker_count: number of Spark workers to deploy
- worker_volume_size: the size of the worker instance volume in Gb
- master_volume_size: the size of the master instance volume in Gb
Run Terraform to deploy a Spark cluster (assuming you already sourced the OpenStack RC file).
cd SparkNow/terraform
terraform get # download terraform modules (required only the first time you deploy)
terraform apply -var-file=conf.tfvars # deploy the cluster
If everity goes well, something like the following will be printed:
Apply complete! Resources: 10 added, 0 changed, 0 destroyed.
The best way to access the UIs is through ssh port forwarding. We discourage to open the ports in the security group.
First, figure out the Spark Master floating IP address, running the following command.
# assuming you are located into SparkNow/terraform
terraform show | grep floating_ip
Then forward the UIs ports using ssh
ssh -N -f -L localhost:8080:localhost:8080 ubuntu@<master-floating-ip>
ssh -N -f -L localhost:4040:localhost:4040 ubuntu@<master-floating-ip>
ssh -N -f -L localhost:8888:localhost:8888 ubuntu@<master-floating-ip>
ssh -N -f -L localhost:9999:localhost:9999 ubuntu@<master-floating-ip>
ssh -N -f -L localhost:50070:localhost:50070 ubuntu@<master-floating-ip>
If everything went well, you should be able to access the UIs from your browser at the following addresses.
- Spark Master UI: http://localhost:8080
- Spark Driver UI, of the currently running application: http://localhost:4040
- Jupyter: http://localhost:8888
- Zeppelin: http://localhost:9999
- HDFS: http://localhost:50070
In a SparkNow cluster the HDFS namenode is reachable at hdfs://<cluster_prefix>-master.node.consul:9000
.
To copy data in HDFS, you can ssh into the SparkNow master node, or ssh forward port 9000, and use the Hadoop CLI.
Finally, there are some preconfigured directories in a SparkNow HDFS cluster:
- /ubuntu writable by the ubuntu user
- /jupyter writable by the jovyan user (you can write here when running interactive Spark applications via Jupyter)
- /root writeble by the root user (can be used when running Zeppelin notes)
To scale the number of workers in your cluster, open the conf.tfvars
file, and change the worker_count property.
Then, apply the changes with Terraform.
# assuming you are located into SparkNow/terraform
terraform apply -var-file=conf.tfvars
Terraform will apply only the delta, without tearing down and recreate the whole cluster.
To destroy the cluster and release all of the resources, you can run the following command.
# assuming you are located into SparkNow/terraform
terraform destroy -var-file=conf.tfvars