RAPIDS Accelerator For Apache Spark

NOTE: For the latest stable README.md ensure you are on the main branch. The RAPIDS Accelerator for Apache Spark provides a set of plugins for Apache Spark that leverage GPUs to accelerate processing via the RAPIDS libraries and UCX. Documentation on the current release can be found at here

The RAPIDS Accelerator for Apache Spark provides a set of plugins for Apache Spark that leverage GPUs to accelerate processing via the RAPIDS libraries and UCX.

The chart above shows results from running ETL queries based off of the TPCxBB benchmark. These are not official results in any way. It uses a 10TB Dataset (scale factor 10,000), stored in parquet. The processing happened on a two node DGX-2 cluster. Each node has 96 CPU cores, 1.5TB host memory, 16 V100 GPUs, and 512 GB GPU memory.

To get started and try the plugin out use the getting started guide.

Compatibility

The SQL plugin tries to produce results that are bit for bit identical with Apache Spark. Operator compatibility is documented here

Tuning

To get started tuning your job and get the most performance out of it please start with the tuning guide.

Configuration

The plugin has a set of Spark configs that control its behavior and are documented here.

Issues

We use github issues to track bugs, feature requests, and to try and answer questions. You may file one here.

Build

There are two types of branches in this repository:

branch-[version]: are development branches which can change often. Note that we merge into the branch with the greatest version number, as that is our default branch.
main: is the branch with the latest released code, and the version tag (i.e. v0.1.0) is held here. main will change with new releases, but otherwise it should not change with every pull request merged, making it a more stable branch.

We use maven for most aspects of the build. Some important parts of the build execute in the "verify" phase of maven. We recommend when building at least running to the "verify" phase.

mvn verify

Tests are described here.

Integration

The RAPIDS Accelerator For Apache Spark does provide some APIs for doing zero copy data transfer into other GPU enabled applications. It is described here.

Currently, we are working with XGBoost to try to provide this integration out of the box.

You may need to disable RMM caching when exporting data to an ML library as that library will likely want to use all of the GPU's memory and if it is not aware of RMM it will not have access to any of the memory that RMM is holding.