Analytics Data Compute

Welcome to the getting started guide of OVH Analytics Data Compute. This guide will help you to understand the core concepts behind the Analytics Data Compute and how to run your first Spark cluster job by using this service.

What is Spark ?

Apache Spark is a bigdata computing platform which is much faster than its competitors like Hadoop MapReduce because of nice features like in memory processing and lazy evaluation. It can be run locally in a single machine or in a cluster of computers to distribute its tasks.

What is the Analytics Data Compute ?

It provides you a ready computing cluster based on Apache Spark. With Analytics Data Compute you don't need to be worry about creating and managing a network of computers for running your Spark job on a cluster of computers.

Using ovh-spark-submit command line, you just send your code and define how much CPU cores you like and Analytics Data Compute will take care of everything. Command line options are almost the same as original spark-submit command line.

When you start a job, Analytics Data Compute engine creates a Spark cluster using virtual machines in OVH public cloud and submits the job to that cluster. Once your job finishes, the cluster is destroyed and the results are sent back to the user. You will be charged only for the virtual machines running during the computation time.

Also the created cluster is only dedicated for this user and this job and it will be deleted after finishing the job, so the security and privacy of the user's data would be highly considered.

In the command line options, users can choose which version of Spark to use and also they can have different formats and sources of data for input and output.

Getting started

Create an OVH Account

Before starting to use the Analytics Data Compute you need to make sure that you have an ovh.com account (NIC). If needed, go to ovh.com and select "Create Account".

Create a Public Cloud Project

In order to spawn automatically an Apache Spark cluster, we need to have access to a new OVH Public Cloud project. This project will contain all the storage and compute required to run your Apache Spark jobs.

Create a new one by following this tutorial. If you have a voucher, you can activate it during this step. If you browse this project, in the "project managed" tab, you'll have all details about consumptions.

Create Openstack user account and openrc.sh

After creating OVH account and project, you need to create an Openstack user account. You can find a tutorial in this link and login to your Horizon dashboard. In Horizon dashboard you will have a link to download your Openstack credential as a bash file, like openrc.sh and you need to source this file to have required environment variables. It is better to download the openrc.sh file in version 3, but if you wanted to use version 2, check the OS_AUTH_URL in the openrc.sh file and add /v2.0 to the end if there isn't. So the OS_AUTH_URL should have the value: OS_AUTH_URL=https://auth.cloud.ovh.net/v2.0

For loading this file after downloading, you can run:

$ source openrc.sh

Then by sourcing this file you will have all Openstack credentials in your environment variables that ovh-spark-submit requires. For better performance it is recommended that you use "GRA5" region and be aware that region SBG3 is not supported yet. For setting the region, you can open openrc.sh in any file editor and set OS_REGION_NAME="GRA5" or any other region you like except SBG3, then source openrc.sh again to update your environment variables.

Download ovh-spark-submit CLI program:

You can download ovh-spark-submit CLI program from these addresses:

for Mac: https://repository.dataconvergence.ovh.com/repository/binary/ovh-spark-submit/mac/ovh-spark-submit

for Linux: https://repository.dataconvergence.ovh.com/repository/binary/ovh-spark-submit/linux/ovh-spark-submit

If the downloader added some extension to the file, (for example safari adds .dms to the files without extension) remove the extension. You can also download the CLI using wget or curl commands.

for Mac:

$ curl -o ovh-spark-submit https://repository.dataconvergence.ovh.com/repository/binary/ovh-spark-submit/mac/ovh-spark-submit

for Linux:

$ curl -o ovh-spark-submit https://repository.dataconvergence.ovh.com/repository/binary/ovh-spark-submit/linux/ovh-spark-submit

Then run this command to make the downloaded file executable:

$ chmod +x ovh-spark-submit

You can also build the ovh-spark-submit instead of downloading by cloning the code and running make all

Run your spark job

usage:

$ ./ovh-spark-submit [options] <jar file> <arguments>

ovh-spark-submit command line options are almost the same as original spark-submit without --deploy-mode and --master. For example:

$ ./ovh-spark-submit \
   --class org.apache.spark.examples.SparkPi \
   spark-examples_2.11-2.4.0.jar 1000

This is the minimum command line. In this case it will create a cluster with 1 master and 1 worker with 4 cores and will installs the latest Spark version. Then the program will run the SparkPi example and shows the result. (You can find spark-examples_2.11-2.4.0.jar file inside the official apache spark package folder)

You can specify the version of Spark and the total number of cores as well. For example:

./ovh-spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --name Simulation01 \
   --version 2.4.0 \
   --total-executor-cores 8 \
   spark-examples_2.11-2.4.0.jar 1000

After running this command, your jar file will be uploaded to the swift storage of your Openstack project. Then a cluster in OVH public cloud will be created and after finishing the computation, it will be automatically deleted.

Run your job in a private network in vRack

You can create your cluster in a private network in vRack. So it is more secure than a public network. For using this feature add the option --deployer vrackfloatingip to your command line. For example:

./ovh-spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --name Simulation01 \
  --version 2.4.0 \
  --total-executor-cores 10 \
  --executor-memory 10G \
  --deployer vrackfloatingip \
  swift://jar/spark-examples_2.11-2.4.0.jar  1000

For using this feature you need to request for floating IP from OVH and currently it is active only in GRA5 region.

Addressing Jar file from Openstack swift storage

By running ovh-spark-submit command line and by addressing your jar file from you local machine, the jar file will be uploaded to your Swift storage in a container named "jar", then the spark cluster will read the jar file from this container. It is possible also to put your jar file in your Swift storage first and use its address in ovh-spark-submit command line. For this purpose you need to add "swift://" at the beginning of the address following container name and then folder path and file name of your jar file. Be aware that names of containers and files in Swift are case sensitive and the jar file in Swift should be in the same region that you mentioned in your openrc.sh file or OS_REGION_NAME environment variable. Using this feature is specially useful when you have a big jar file and slow internet connection and you run the cluster several times without any change in the jar file and thus you don't like to upload the same jar file each time you run a Spark job. For example:

./ovh-spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --name Simulation01 \
   --version 2.4.0 \
   --total-executor-cores 8 \
   swift://jar/spark-examples_2.11-2.4.0.jar 1000

Log files

After finishing the Spark job, the log file will be saved in your Swift storage in the same region that you mentioned in your openrc.sh file or in OS_REGION_NAME environment variable and in "SparkLogs" container. So, to download the logs and results you can go to horizon dashboard: https://horizon.cloud.ovh.net and then go to "Object Store" -> Containers -> SparkLogs and you will find the logs folders based on date and time of Spark job.

Also a copy of log file will be saved in your local machine in "SparkLogs" folder in your home directory.

Also you can find the address of Spark official master dashboard and SparkUI in this log page and you can open dashboard and UI separately if you like. Master dashboard is on port 8080 of master IP (like: http://1.2.3.4:8080) and SparkUI will be on port 4040 (like: http://1.2.3.4:4040). Then you can see all stdout and stderr of all workers and apps and also more details and information about your cluster.

Pro tip #1 : How to calculate your billing ?

For creating the cluster we use flavor b2-15, it means that each worker node will have 4 cores and 15 GB memory. For example if you add the option: total-executor-cores 8, you will need 8/4=2 worker nodes plus one for master node and totally 3 nodes. Then according to the time of execution, you can calculate the cost of service for each job based on the cost of b2-15 in OVH tarifs website. (For example, here is the FR Pricing.). Be aware that it will be calculated per hour basis. For example if you use a cluster for 5 minutes, it will be considered as 1 hour.

Pro tip #2 : Want to keep your cluster ?

There is an option which you can create cluster and keep it and send as many as jobs that you like. You just need to add --keep-infra option to your command line. But you need to delete the cluster when you don't need the cluster anymore. Be careful because if you forget to delete the cluster, you will be charged for the VMs that you have in your project. After running the command line, you will find the address of Spark master in the output log.

./ovh-spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --name Simulation01 \
   --version 2.4.0 \
   --total-executor-cores 8 \
   --keep-infra \
   swift://jar/spark-examples_2.11-2.4.0.jar 1000

When you don't need the cluster anymore you can go to ovh.com manager or your Openstack Horizon dashboard to deleted the cluster VM's.

mojtabaimani/ovh-spark-submit