/spark-mpi

MPI-oriented extension of the Spark computational model

Primary LanguageC

SPARK-MPI

The project addresses the existing impedance mismatch between data-intensive and compute-intensive ecosystems by extending the Spark platform with the MPI-based inter-worker communication model for supporting HPC applications. The rationale along with a general description are provided in the arXiv, NYSDS paper and Spark Summit East'17 talk (located in the doc directory) :

Conceptual Demo

The Spark-MPI approach is illustrated within the context of a conceptual demo (located in the examples/spark-mpi directory) which runs the MPI Allreduce method on the Spark workers.

Prerequisites

  1. Anaconda3-4.2.0 with Python 3.5 (note: Spark 2.1 does not support Python 3.6)
install anaconda
conda install libgcc
  1. Spark 2.2
download spark
cd spark
export PYSPARK_PYTHON=python3
./build/mvn -DskipTests clean package
  1. Open MPI 3.0.0
./configure --prefix=<installation directory> --with-cuda --with-libevent=external
make
make install
  1. MPI python wrapper, for example mpi4py 3.0
wget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.0.tar.gz  
python setup.py build
python setup.py install

Installation

export MPI_SRC=<Open MPI build directory>

git clone https://github.com/SciDriver/spark-mpi.git
mkdir build

cd build
cmake ../spark-mpi -DCMAKE_INSTALL_PREFIX=<installation directory>
make
sudo make install