ceslobfer/GPUmatrix

Problems solving sparse linear least squares regression using Cholesky decomposition

Closed this issue · 5 comments

Thanks for your great work on this package!

I was wondering if you could advise me a bit on how to use GPUmatrix to solve a sparse linear least squares regression using Cholesky decomposition.

Given my sparse covariate matrix Xswaug and dense outcome variable ywaug I was doing this in base R as follows:

library(devtools)
devtools::install_github("tomwenseleers/L0glm/L0glm")
library(L0glm)
sim = simulate_spike_train(n=10000, p=10000, k=200, sparse=TRUE) # blurred spike train with 10000 timepoints & 200 blurred peaks
library(Matrix)
Xs = sim$X # sparse covariate matrix Xs of class dgCMatrix (time-shifted copies of the point spread function)
y = sim$y
weights <- 1/(y+0.1) # observation weights (approx 1/Poisson noise variance)
yw = y*sqrt(weights) # weighted y
ywaug = c(yw, rep(0, ncol(Xs))) # augmented weighted y
lambdas <- rep(1, ncol(Xs)) # ridge penalties on individual coefficients
Xsw <- Diagonal(x=sqrt(weights)) %*% Xs # weighted X = X*sqrt(weights)
Xswaug <- rbind(Xsw, Diagonal(x=sqrt(lambdas))) # row augmented covariate matrix

# Cholesky decomposition based solve of weighted least squares problem
chol.solve = function (X, y, tol=.Machine$double.eps^(2/3)) {
  ch <- base::chol(crossprod(X)) # solve using Cholesky decomposition
  backsolve(ch, forwardsolve(ch, crossprod(X, y), upper = TRUE, trans = TRUE))
}

This runs very fast when using Intel MKL (Microsoft R Open 4.0.2) (using 8 cores on my Intel Core i9) :
system.time(s1 <- chol.solve(Xswaug, ywaug)) # 1.4s when using Intel MKL (Microsoft R Open 4.0.2), 190s without Intel MKL

I thought the equivalent in GPUmatrix would be to use chol_solve, but for some reason my timings get much worse compared to the Intel MKL Cholesky solve (39s with GPU version of tensorflow installed or 24s with CPU version of tensorflow installed, this is using my Nvidia RTX A2000 laptop GPU) :

# devtools::install_github(" ceslobfer/GPUmatrix")
library(GPUmatrix)
system.time({Xswaug_gpu <- gpu.matrix(Xswaug, type = "tensorflow", dtype = "float32", sparse = T)
            ywaug_gpu <- gpu.matrix(ywaug, type = "tensorflow", dtype = "float32", sparse = F)  
            s2 <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu, ywaug_gpu))}) # 39s with GPU version of tensorflow, 24s with CPU version of tensorflow

Is this to be expected? Or am I doing something wrong here, and would I need explicitly defined backsolve & forwardsolve functions here?

If I use torch as a backend I get these errors, no matter if I use the CPU or the GPU versions:

system.time({Xswaug_gpu <- gpu.matrix(Xswaug, device = "cpu", dtype = "float32", sparse = T)
             ywaug_gpu <- gpu.matrix(ywaug, device = "cpu", dtype = "float32", sparse = F)  
             s_toch_cpu <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu, ywaug_gpu))})
# returns errors
# [W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp:1626] Warning: torch.cholesky is deprecated in favor of torch.linalg.cholesky and will be removed in a future PyTorch release.
# L = torch.cholesky(A)
# should be replaced with
# L = torch.linalg.cholesky(A)
# and
# U = torch.cholesky(A, upper=True)
# should be replaced with
# U = torch.linalg.cholesky(A).mH().
# This transform will produce equivalent results for all valid (symmetric positive definite) inputs. (function operator ())
# Error in (function (self, input2, upper)  : 
#             A must be batches of square matrices, but they are 10000 by 1 matrices
#           Exception raised from linearSolveCheckInputs at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen/native/LinearAlgebraUtils.h:288 (most recent call first):
#             00007FFCBD16AD1200007FFCBD16ACB0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
#           00007FFCBD16A79E00007FFCBD16A750 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>]
#           00007FFC3935F1B900007FFC3935EB90 torch_cpu.dll!at::native::linalg_vecdot_out [<unknown file> @ <unknown line number>]
#           00007FFC3934D75500007FFC3934D270 torch_cpu.dll!at::native::_cholesky_solve_helper_cpu [<unknown file> @ <unknown line number>]
#           00007FFC39351C3200007FFC39351B10 torch_cpu.dll!at::native::cholesky_solve [<unknown file> @ <unknown line number>]
#           00007FFC3A10946E00007FFC3A1080E0 torch_cpu.dll!at::compositeexplicitautograd::xlogy_ [<unknown file> @ <unknown line number>]
#           00007FFC3A0DD46A00007FFC3A0B7CA0 torch_c
#           In addition: Warning message:
#             In warningSparseTensor_torch(y) :
#             Not allowed with sparse matrix, matrix will be cast to dense for the operation. May cause memory problems
#           Timing stopped at: 7.84 0.37 3.21

system.time({Xswaug_gpu <- gpu.matrix(Xswaug, device = "gpu", dtype = "float32", sparse = T)
                                      ywaug_gpu <- gpu.matrix(ywaug, device = "gpu", dtype = "float32", sparse = F)  
                                      s_toch_gpu <- chol_solve(chol(crossprod(Xswaug_gpu)), crossprod(Xswaug_gpu, ywaug_gpu))}) # 39s
# returns errors
# Error in (function (self, input2, upper)  : 
#             A must be batches of square matrices, but they are 10000 by 1 matrices
#           Exception raised from linearSolveCheckInputs at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen/native/LinearAlgebraUtils.h:288 (most recent call first):
#             00007FFCBD16AD1200007FFCBD16ACB0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
#           00007FFCBD16A79E00007FFCBD16A750 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>]
#           00007FFC3935F1B900007FFC3935EB90 torch_cpu.dll!at::native::linalg_vecdot_out [<unknown file> @ <unknown line number>]
#           00007FFC3934D75500007FFC3934D270 torch_cpu.dll!at::native::_cholesky_solve_helper_cpu [<unknown file> @ <unknown line number>]
#           00007FFC39351C3200007FFC39351B10 torch_cpu.dll!at::native::cholesky_solve [<unknown file> @ <unknown line number>]
#           00007FFC3A10946E00007FFC3A1080E0 torch_cpu.dll!at::compositeexplicitautograd::xlogy_ [<unknown file> @ <unknown line number>]
#           00007FFC3A0DD46A00007FFC3A0B7CA0 torch_c
#           In addition: Warning message:
#             In warningSparseTensor_torch(y) :
#             Not allowed with sparse matrix, matrix will be cast to dense for the operation. May cause memory problems
#           Timing stopped at: 6.45 0.3 1.54

Any thoughts for how to get rid of these errors and how to obtain good performance for this usage case?

Regarding the torch. We will check htat. While the GPUmatrix was being checked at CRAN, torch cahnged its dependency to a different CUDA version (now is 11.7) and was updated. We have to rewrite the code related to torch to remove warnings and, of course, errors.

Many thanks for your detailed replies - that's super helpful! Bit of a shame maybe that support for sparse matrices/tensors is not better developed yet in tensorflow and torch - that no doubt must explain the timings. I also tried the RcppEigen sparse solvers (https://eigen.tuxfamily.org/dox/group__TopicSparseSystems.html) and SimplicialLLT and SimplicialLDLT and ConjugateGradient all had good timings too, similar to Cholesky with IntelMKL above, but with the advantage that timings are independent of the installed BLAS in R (LeastSquaresConjugateGradient was also good, but solution was not always very stable).

Hello Tom, thank you for using our package and giving feedback. It is very useful for us to polish the details. Indeed, the function for performing Cholesky in TensorFlow is not very refined. We have noticed a small bug in the chol_sol() function of GPUmatrix using Torch, the arguments are reversed. It was an oversight on our part not to realize this, as our test conditions did not fail. All you have to do in the code is the following:
system.time({Xswaug_gpu <- gpu.matrix(Xswaug, device = "gpu", dtype = "float32", sparse = T) ywaug_gpu <- gpu.matrix(ywaug, device = "gpu", dtype = "float32", sparse = F) s_toch_gpu <- chol_solve( crossprod(Xswaug_gpu, ywaug_gpu),chol(crossprod(Xswaug_gpu)))})

I have tried running it and did not have any issues doing it this way. Additionally, the execution was quite fast. You can try doing it like this. We will fix the bug on CRAN shortly. Thank you again for the feedback.

In my computer (i7 10700 + RTX 2080 Ti) This code

library(GPUmatrix)
Xswaug_gpu <- gpu.matrix(Xswaug, dtype = "float32", sparse = T)
ywaug_gpu <- gpu.matrix(ywaug, dtype = "float32", sparse = F)  
system.time(s2 <- chol_solve(crossprod(Xswaug_gpu, ywaug_gpu),chol(crossprod(Xswaug_gpu))))

takes 0.40 s (float32) on torch (the default value).
Using float64,...

Xswaug_gpu <- gpu.matrix(Xswaug, dtype = "float64", sparse = T)
ywaug_gpu <- gpu.matrix(ywaug, dtype = "float64", sparse = F)  
system.time(s2 <- chol_solve(crossprod(Xswaug_gpu, ywaug_gpu),chol(crossprod(Xswaug_gpu))))

it takes 1.2s. Using the CPU (in my computer) took around 1.9s. It seems that the implementation in torch is better than in Tensorflow.