/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

Primary LanguageScalaApache License 2.0Apache-2.0

Sparkling Water

Join the chat at https://gitter.im/h2oai/sparkling-water

Sparkling Water integrates H2O's fast scalable machine learning engine with Spark.

Requirements

  • Linux or OS X (Windows support is pending)
  • Java 7
  • Spark 1.3.1
    • SPARK_HOME shell variable must point to your local Spark installation

Contributing

Look at our list of JIRA tasks for new contributors or send your idea to support@h2o.ai.


Issues

To report issues, please use our JIRA page at http://jira.h2o.ai/.


Mailing list

Follow our H2O Stream.


Downloads of binaries


Making a build

Use the provided gradlew to build project:

./gradlew build

To avoid running tests, use the -x test option.


Sparkling shell

The Sparkling shell provides a regular Spark shell that supports creation of an H2O cloud and execution of H2O algorithms.

First, build a package containing Sparkling water:

./gradlew assemble

Configure the location of Spark cluster:

export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"

In this case, local-cluster[3,2,1024] points to embedded cluster of 3 worker nodes, each with 2 cores and 1G of memory.

And run Sparkling Shell:

bin/sparkling-shell

Sparkling Shell accepts common Spark Shell arguments. For example, to increase memory allocated by each executor, use the spark.executor.memory parameter: bin/sparkling-shell --conf "spark.executor.memory=4g"


Running examples

Build a package that can be submitted to Spark cluster:

./gradlew assemble

Set the configuration of the demo Spark cluster (for example, local-cluster[3,2,1024])

export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"

In this example, the description local-cluster[3,2,1024] causes the creation of an embedded cluster consisting of 3 workers.

And run the example:

bin/run-example.sh

For more details about the demo, please see the README.md file in the examples directory.


Additional Examples

You can find more examples in the examples folder.


Docker Support

See docker/README.md to learn about Docker support.


FAQ

  • Where do I find the Spark logs?

Spark logs are located in the directory $SPARK_HOME/work/app-<AppName> (where <AppName> is the name of your application.

  • Spark is very slow during initialization or H2O does not form a cluster. What should I do?

Configure the Spark variable SPARK_LOCAL_IP. For example:

export SPARK_LOCAL_IP='127.0.0.1'
  • How do I increase the amount of memory assigned to the Spark executors in Sparkling Shell?

Sparkling Shell accepts common Spark Shell arguments. For example, to increase the amount of memory allocated by each executor, use the spark.executor.memory parameter: bin/sparkling-shell --conf "spark.executor.memory=4g"

  • How do I change the base port H2O uses to find available ports?

    The H2O accepts spark.ext.h2o.port.base parameter via Spark configuration properties: bin/sparkling-shell --conf "spark.ext.h2o.port.base=13431". For a complete list of configuration options, refer to Devel Documentation.

  • How do I use Sparkling Shell to launch a Scala test.script that I created?

Sparkling Shell accepts common Spark Shell arguments. To pass your script, please use -i option of Spark Shell: bin/sparkling-shell -i test.script

#Diagram of Sparkling Water on YARN

The following illustration depicts the topology of a Sparkling Water cluster of three nodes running on YARN: ![Diagram](images/Sparkling Water cluster.png)