Elephas: Keras Deep Learning on Apache Spark

Introduction

Elephas brings deep learning with Keras to Apache Spark. Elephas intends to keep the simplicity and usability of Keras, allowing for fast prototyping of distributed models to run on large data sets.

ἐλέφας is Greek for ivory and an accompanying project to κέρας, meaning horn. If this seems weird mentioning, like a bad dream, you should confirm it actually is at the Keras documentation. Elephas also means elephant, as in stuffed yellow elephant.

For now, elephas is a straight forward parallelization of Keras using Spark's RDDs and data frames. Models are initialized on the driver, then serialized and shipped to workers. Spark workers deserialize the model and train their chunk of data before broadcasting their parameters back to the driver. The "master" model is updated by averaging worker parameters.

Getting started

Install elephas from PyPI with

pip install elephas

A quick way to install Spark locally is to use homebrew on Mac

brew install spark

or linuxbrew on linux

brew install apache-spark

If this is not an option, you should simply follow the instructions at the Spark download section. After installing both Keras and Spark, training a model is done as follows:

Create a local pyspark context

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('Elephas_App').setMaster('local[8]')
sc = SparkContext(conf=conf)

Define and compile a Keras model

model = Sequential()
model.add(Dense(784, 128))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(128, 128))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(128, 10))
model.add(Activation('softmax'))

rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms)

Create an RDD from numpy arrays

from elephas.utils.rdd_utils import to_simple_rdd
rdd = to_simple_rdd(sc, X_train, Y_train)

Define a SparkModel from Spark context and Keras model, then simply train it

from elephas.spark_model import SparkModel
spark_model = SparkModel(sc,model)
spark_model.train(rdd, nb_epoch=20, batch_size=32, verbose=0, validation_split=0.1, num_workers=8)

Run your script using spark-submit

spark-submit --driver-memory 1G ./your_script.py

Increasing the driver memory even further may be necessary, as the set of parameters in a network may be very large and collecting them on the driver eats up a lot of resources. See the examples folder for a few working examples.

Spark MLlib integration

Following up on the last example, to create an RDD of LabeledPoints for supervised training from pairs of numpy arrays, use

from elephas.utils.rdd_utils import to_labeled_point
lp_rdd = to_labeled_point(sc, X_train, Y_train, categorical=True)

Training a given LabeledPoint-RDD is very similar to what we've seen already

from elephas.spark_model import SparkMLlibModel
spark_model = SparkMLlibModel(sc,model)
spark_model.train(lp_rdd, nb_epoch=20, batch_size=32, verbose=0, validation_split=0.1, num_workers=8, categorical=True, nb_classes=nb_classes)

In the pipeline

Integration with Spark MLLib:
- Train and evaluate LabeledPoints RDDs
- Make elephas models MLLib algorithms
Integration with Spark ML:
- Use DataFrames for training
- Make models ML pipeline components

aishfenton/elephas

Elephas: Keras Deep Learning on Apache Spark

Introduction

Getting started

Spark MLlib integration

In the pipeline