/cudf

cuDF - GPU DataFrame Library

Primary LanguageCudaApache License 2.0Apache-2.0

 cuDF - GPU DataFrames

Build Status  Documentation Status

The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.

Quick Start

Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.

Install cuDF

Conda

You can get a minimal conda installation with Miniconda or get the full installation with Anaconda.

You can install and update cuDF using the conda command:

conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf=0.3.0

Note: This conda installation only applies to Linux and Python versions 3.5/3.6.

You can create and activate a development environment using the conda commands:

# create the conda environment (assuming in base `cudf` directory)
$ conda env create --name cudf_dev --file conda/environments/dev_py35.yml
# activate the environment
$ source activate cudf_dev
# when not using default arrow version 0.10.0, run
$ conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults pyarrow=$ARROW_VERSION

This installs the required cmake, nvstrings, pyarrow and other dependencies into the cudf_dev conda environment and activates it.

Pip

Support is coming soon, please use conda for the time being.

Development Setup

The following instructions are tested on Linux Ubuntu 16.04 & 18.04, to enable from source builds and development. Other operatings systems may be compatible, but are not currently supported.

Get libcudf Dependencies

Compiler requirements:

  • gcc version 5.4
  • nvcc version 9.2
  • cmake version 3.12

CUDA/GPU requirements:

  • CUDA 9.2+
  • NVIDIA driver 396.44+
  • Pascal architecture or better

You can obtain CUDA from https://developer.nvidia.com/cuda-downloads

Since cmake will download and build Apache Arrow (version 0.7.1 or 0.8+) you may need to install Boost C++ (version 1.58+) before running cmake:

# Install Boost C++ for Ubuntu 16.04/18.04
$ sudo apt-get install libboost-all-dev

or

# Install Boost C++ for Conda
$ conda install -c conda-forge boost

Build from Source

To install cuDF from source, ensure the dependencies are met and follow the steps below:

  1. Clone the repository
git clone --recurse-submodules https://github.com/rapidsai/cudf.git
cd cudf
  1. Create the conda development environment cudf as detailed above
  2. Build and install libcudf
$ cd /path/to/cudf/cpp                              # navigate to C/C++ CUDA source root directory
$ mkdir build                                       # make a build directory
$ cd build                                          # enter the build directory
$ cmake .. -DCMAKE_INSTALL_PREFIX=/install/path     # configure cmake ... use $CONDA_PREFIX if you're using Anaconda
$ make -j                                           # compile the libraries librmm.so, libcudf.so ... '-j' will start a parallel job using the number of physical cores available on your system
$ make install                                      # install the libraries librmm.so, libcudf.so to '/install/path'

To run tests (Optional):

$ make test

Build and install cffi bindings:

$ make python_cffi                                  # build CFFI bindings for librmm.so, libcudf.so
$ make install_python                               # install python bindings into site-packages
$ cd python && py.test -v                           # optional, run python tests on low-level python bindings
  1. Build the cudf python package, in the python folder:
$ cd ../../python
$ python setup.py build_ext --inplace

To run Python tests (Optional):

$ py.test -v                                        # run python tests on cudf python bindings
  1. Finally, install the Python package to your Python path:
$ python setup.py install                           # install cudf python bindings

Automated Build in Docker Container

A Dockerfile is provided with a preconfigured conda environment for building and installing cuDF from source based off of the master branch.

Prerequisites

  • Install nvidia-docker2 for Docker + GPU support
  • Verify NVIDIA driver is 396.44 or higher
  • Ensure CUDA 9.2+ is installed

Usage

From cudf project root run the following, to build with defaults:

$ docker build --tag cudf .

After the container is built run the container:

$ docker run --runtime=nvidia -it cudf bash

Activate the conda environment cudf to use the newly built cuDF and libcudf libraries:

root@3f689ba9c842:/# source activate cudf
(cudf) root@3f689ba9c842:/# python -c "import cudf"
(cudf) root@3f689ba9c842:/#

Customizing the Build

Several build arguments are available to customize the build process of the container. These are spcified by using the Docker build-arg flag. Below is a list of the available arguments and their purpose:

Build Argument Default Value Other Value(s) Purpose
CUDA_VERSION 9.2 10.0 set CUDA version
LINUX_VERSION ubuntu16.04 ubuntu18.04 set Ubuntu version
CC & CXX 5 7 set gcc/g++ version; NOTE: gcc7 requires Ubuntu 18.04
CUDF_REPO This repo Forks of cuDF set git URL to use for git clone
CUDF_BRANCH master Any branch name set git branch to checkout of CUDF_REPO
NUMBA_VERSION 0.40.0 Not supported set numba version
NUMPY_VERSION 1.14.3 Not supported set numpy version
PANDAS_VERSION 0.20.3 Not supported set pandas version
PYARROW_VERSION 0.10.0 0.8.0+ set pyarrow version
PYTHON_VERSION 3.5 3.6 set python version

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.