/glmrgame

glm computations in R on gpu's.

Primary LanguageCOtherNOASSERTION

glmrgame

🚨 Highly experimental 🚨

glmrgame (pronounced "glimmer-game") is a package for glm-like computations in R ("glimmer") run on gpu's (video game hardware), with computations distributed over MPI.

Installation

The development version is maintained on GitHub:

remotes::install_github("RBigData/glmrgame")

You will need to have an installation of CUDA to build the package. You can download CUDA from the nvidia website. You will also need the development version of the pbdMPI and kazaam packages (and optionally the curand R package):

remotes::install_github("wrathematics/pbdMPI")
remotes::install_github("rbigdata/kazaam")
remotes::install_github("wrathematics/curand")

There is a reference cpu version of the package that you can build. However, this is not supported or recommended; please just use kazaam. But if you insist, you can install it via

remotes::install_github("rbigdata/glmrgame", configure.args="--with-backend=CPU")

Examples

suppressMessages(library(kazaam))
suppressMessages(library(glmrgame))

data(iris)
is_setosa = expand((iris[, 5] == "setosa")*2 - 1)
iris = expand(as.matrix(iris[, -5]))
iris = cbind(shaq(1, nrow=nrow(iris), ncol=1), iris)

w_cpu = svm(iris, is_setosa)
w_gpu = svm_game(iris, is_setosa)

finalize()

Benchmarks

All timings are from:

  • A DGX-1
  • R 3.4.4
  • OpenBLAS
  • CUDA 9.0.176

The benchmark requires of generating data from 2 different random normal distributions and using svm to classify the data. The data consists 251 columns (250 data + 1 intercept), and however many rows required for a desired total dataset size.

For reasons I don't wish to explain, I am using 13 cores for the cpu-only runs for every one gpu of the gpu runs. The goal is for the gpu runs to be faster even at this 13-to-1 ratio. First we set:

For a 16 GiB total problem size (distributed among the MPI ranks), we get:

### gpu --- 2 resources 
data time: 64.57 
svm time:  11.971 
accuracy:  100 

### cpu --- 26 resources (2 threads per rank)
data time: 16.197 
svm time:  111.445 
accuracy:  73.7294574940224 

### cpu --- 26 resources (1 thread per rank)
data time: 16.307 
svm time:  109.208 
accuracy:  73.7294574940224 

Data generation for the gpu case is done of the gpu using the curand R package. However, this approach requires many more memory operations, and the local problem size is 13x larger than each rank in the cpu-only case. Hence the relatively poor performance.

If we re-run with 1 gpu vs 13 cores (instead of 2 vs 26 above) on half the problem size (8 GiB total), we get:

### gpu --- 1 resources 
data time: 57.361 
svm time:  10.74 
accuracy:  100 

### cpu --- 13 resources (2 threads per rank)
data time: 15.057 
svm time:  96.4 
accuracy:  73.7324346688658 

### cpu --- 13 resources (1 thread per rank)
data time: 14.735 
svm time:  96.637 
accuracy:  73.7324346688658 

The scripts are in the scripts/ directory of the source tree of glmrgame.