/h2o-sparkling

DEPRECATED! Use https://github.com/h2oai/sparkling-water repository! H2O and Spark interoperability based on Tachyon.

Primary LanguageScalaApache License 2.0Apache-2.0

This repository is DEPRECATED! Please use the new Sparkling Water repository https://github.com/h2oai/sparkling-water!


h2o-sparkling

Makes interoperability between H2O and Spark trivial.

Requirements

  • Spark 1.0.0 (SQL component required)
  • Tachyon 0.4.1
  • Java 1.6+

Installation

  • First compile latest version of spark with SQL component
git clone spark
cd spark
sbt/sbt assembly publish-local
cd h2o-sparkling-demo
sbt assembly

Note: The assembly stage is important, since the demo is a Spark driver sending a jar-file containing implementation of a working job.

Run demo

Run local version

For this run no Spark cloud is required:

  • Execute an instance of H2O embedding Spark driver
cd h2o-sparkling-demo
sbt "run --local"

Run distributed version

For this run a Spark cloud is required:

  • Run master and one worker on local node
cd spark/sbin
./start-master.sh
./start-slave.sh 1 "spark://localhost:7077"
  • Assembly h2o-sparkling-demo jar file which can be sent by the driver to Spark cloud
cd h2o-sparkling-demo
sbt assembly
sbt "run --remote"

Run additional H2O node

cd h2o-sparkling-demo
sbt runH2O

Select different RDD2Frame extractor

Currently demo supports three extractors:

  • dummy - pull all data into driver and create a frame
  • file - ask Spark to save RDD as a file on local filesystem and then parse a stored file
  • tachyon - ask Spark to save RDD to tachyon filesystem, then H2O load a file from tachyon FS

The extractor can be selected via --extractor command line parameter, e.g., --extractor==tachyon

Running with Tachyon

  • Start Tachyon
cd tachyon/bin
./tachyon-start.sh

Example

Run a demo with Tachyon-based extractor againts remote Spark cloud:

cd h2o-sparkling-demo
sbt assembly
sbt "run --remote --extractor=tachyon"

Run airlines demo with file-based extractor againts remote Spark cloud running on non-default location:

sbt "run --remote --sparkMaster=spark://localhost:17077 --noshutdown --demo=airlines --extractor=file"

Doc