/sparkucx

A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer

Primary LanguageScalaBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

SparkUCX ShuffleManager Plugin

SparkUCX is a high performance ShuffleManager plugin for Apache Spark, that uses RDMA and other high performance transports that are supported by UCX, to perform Shuffle data transfers in Spark jobs.

This open-source project is developed, maintained and supported by the UCF consortium.

Runtime requirements

Installation

Obtain SparkUCX

Please use the "Releases" page to download SparkUCX jar file for your spark version (e.g. spark-ucx-1.0-for-spark-2.4.0-jar-with-dependencies.jar). Put SparkUCX jar file in $SPARK_UCX_HOME on all the nodes in your cluster.
If you would like to build the project yourself, please refer to the "Build" section below.

Ucx binaries must be in Spark classpath on every Spark Master and Worker. It can be obtained by installing latest version of Mellanox OFED or following ucx build instruction. E.g.:

% export UCX_PREFIX=/usr/local
% git clone https://github.com/openucx/ucx.git
% cd ucx
% ./contrib/configure-release --with-java –-prefix=$UCX_PREFIX
% make -j`nproc` && make install

Configuration

Provide Spark the location of the SparkUCX plugin jars and ucx shared binaries by using the extraClassPath option.

spark.driver.extraClassPath     $SPARK_UCX_HOME/spark-ucx-1.0-for-spark-2.4.0-jar-with-dependencies.jar:$UCX_PREFIX/lib
spark.executor.extraClassPath   $SPARK_UCX_HOME/spark-ucx-1.0-for-spark-2.4.0-jar-with-dependencies.jar:$UCX_PREFIX/lib

To enable the SparkUCX Shuffle Manager plugin, add the following configuration property to spark (e.g. in $SPARK_HOME/conf/spark-defaults.conf):

spark.shuffle.manager   org.apache.spark.shuffle.UcxShuffleManager

For spark-3.0 version add SparkUCX ShuffleIO plugin:

spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.compat.spark_3_0.UcxLocalDiskShuffleDataIO

Build

Building the SparkUCX plugin requires Apache Maven and Java 8+ JDK

Build instructions:

% git clone https://github.com/openucx/sparkucx
% cd sparkucx
% mvn -DskipTests clean package -Pspark-2.4

Performance

SparkUCX plugin is built to provide the best performance out-of-the-box, and provides multiple configuration options to further tune SparkUCX per-job. For more information on how to setup HiBench benchmark and reproduce results, please refer to Accelerated Apache SparkUCX 2.4/3.0 cluster deployment.

Performance results