/sahara

Sahara aims to provide users with simple means to provision a data intensive cluster (Hadoop, Spark) by specifying several parameters like software versions, cluster topology, nodes hardware details and a few more.

Primary LanguagePythonApache License 2.0Apache-2.0

OpenStack Data Processing ("Sahara") project (Spark experimental fork)

This repository is a fork of the main OpenStack Sahara repo. This fork relates mainly to the Spark plugin development, with bug fixes, optimizations and updates related to the work of the Bigfoot project: http://bigfootproject.eu/

To use this version of Sahara, you will need images created with this fork of the image builder: https://github.com/bigfootproject/sahara-image-elements

The main changes from the standard Sahara are:

  • Support for more recent Spark versions, currently we are supporting Spark 1.5.0
  • Spark Notebook (https://github.com/andypetrella/spark-notebook) support. You can create a Spark cluster with notebooks already available and configured. Like iPython, but with Spark! The Spark Notebook is listed in the processes list when creating a new node group template. You can have at maximum one notebook process per cluster. Once the cluster has been started, a link to the notebook can be found at the bottom of the cluster information page.
  • Relaxed checks to let the user create HDFS-only and Spark-only clusters: this allows the concept of storage-only clusters, relatively static, and compute-only clusters that come and go.
  • Spark clusters can be configured with a default HDFS location
  • Data locality: by using the cluster-level "HDFS storage cluster" option a compute cluster will be co-located on the same physical hosts on which the datanodes for that storage cluster are found
  • Swift data source for Spark, with fixes for Spark 1.3
  • Smaller fixes and workarounds for bugs, while waiting for a proper fix in upstream Sahara

This repository is periodically merged with the upstream Sahara master branch.

Contact us

This fork of Sahara is developed and maintained by the Distributed Systems Group at Eurecom (http://www.eurecom.fr).

License

Apache License Version 2.0 http://www.apache.org/licenses/LICENSE-2.0