Evovest/EvoTrees.jl

device="gpu" not working with MLJ interface

xgdgsc opened this issue · 7 comments

I just ran the 2 examples in readme, only difference is device parameter. The MLJ interface doesn' t use gpu when fit! while the internal API version uses the gpu well.

@xgdgsc Thanks for reporting.

I maintain MLJ and may be able to help. Just at the moment my access to gpu is playing up and not able to reproduce just now. In the meantime, it would help if you could provide a few details. For example, is there any error thrown? If not, what makes you say the MLJ model is not running on the GPU? Is it possible that it is running but not running as fast as expected?

Thanks.

No error output. I checked nvidia-smi and there is no gpu process when using MLJ.

Okay, I have a different experience. CPU run is fine, but GPU run throws this error:

julia> fit!(mach_gpu, rows=train, verbosity=1)                                                                                                        
[ Info: Training Machine{EvoTreeRegressor{Float64,},}.                                                                                              
┌ Error: Problem fitting the machine Machine{EvoTreeRegressor{Float64,},}.                                                                          
└ @ MLJBase ~/.julia/packages/MLJBase/QXObv/src/machines.jl:533                                                                                       
[ Info: Running type checks...                                                                                                                        
[ Info: Type checks okay.                                                                                                                             
ERROR: MethodError: no method matching update_grads!(::EvoTrees.Linear, ::Matrix{Float64}, ::Matrix{Float64}, ::CUDA.CuArray{Float64, 1, CUDA.Mem.Dev\
iceBuffer}, ::Float64)                                                                                                                                
Closest candidates are:                                                                                                                               
  update_grads!(::EvoTrees.Linear, ::Matrix{T}, ::Matrix{T}, ::Vector{T}, ::T) where T<:AbstractFloat at ~/Julia/EvoTrees/src/loss.jl:2               
  update_grads!(::EvoTrees.Logistic, ::Matrix{T}, ::Matrix{T}, ::Vector{T}, ::T) where T<:AbstractFloat at ~/Julia/EvoTrees/src/loss.jl:10            
  update_grads!(::EvoTrees.Poisson, ::Matrix{T}, ::Matrix{T}, ::Vector{T}, ::T) where T<:AbstractFloat at ~/Julia/EvoTrees/src/loss.jl:19    

@jeremiedb I don't know yet if this is a bug, but is there a reason the EvoTrees.update_grads! are not defined in generic way? Why Matrix and Vector instead of AbstractMatrix and AbstractVector in this line:

function update_grads!(::Linear, δ𝑤::Matrix{T}, p::Matrix{T}, y::Vector{T}, α::T) where {T<:AbstractFloat}
?

I mean, how else do you handle CUDA arrays? What am I missing?

Code throwing above error:

using StatsBase: sample
using EvoTrees
using EvoTrees: sigmoid, logit
using MLJBase
using Flux

features = rand(Float32, 10_000) .* 5 .- 2

X = reshape(features, (size(features)[1], 1)) |> MLJBase.table
Y = sin.(features) .* 0.5f0 .+ 0.5f0
Y = logit(Y) + randn(Float32, size(Y))
Y = sigmoid(Y)
y = Y

const Xgpu = MLJBase.table(gpu(MLJBase.matrix(X)))
const ygpu = gpu(y)

# @load EvoTreeRegressor                                                                                                                              
# linear regression                                                                                                                                   
tree_model = EvoTreeRegressor(loss=:linear, max_depth=5, η=0.05, nrounds=10)

# set machine                                                                                                                                         
mach_gpu = machine(tree_model, Xgpu, ygpu)

# partition data                                                                                                                                      
train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split                                                                               

# fit data                                                                                                                                            
fit!(mach_gpu, rows=train, verbosity=1)
(@testflux) pkg> status                                                                                                                               
      Status `~/.julia/environments/testflux/Project.toml`                                                                                            
  [f6006082] EvoTrees v0.9.1 `~/Julia/EvoTrees`                                                                                                       
  [587475ba] Flux v0.12.8                                                                                                                             
  [a7f614a8] MLJBase v0.18.26                                                                                                                         
  [094fc8d1] MLJFlux v0.2.6 `~/Julia/MLJ/MLJFlux`                                                                                                     
  [2913bbd2] StatsBase v0.33.14                                                                                                                       
  [bd369af6] Tables v1.6.1      

                                                                                                                                                      
julia> versioninfo()                                                                                                                                  
Julia Version 1.7.0                                                                                                                                   
Commit 3bf9d17731 (2021-11-30 12:12 UTC)                                                                                                              
Platform Info:                                                                                                                                        
  OS: Linux (x86_64-pc-linux-gnu)                                                                                                                     
  CPU: Intel Core Processor (Broadwell, IBRS)                                                                                                         
  WORD_SIZE: 64                                                                                                                                       
  LIBM: libopenlibm                                                                                                                                   
  LLVM: libLLVM-12.0.1 (ORCJIT, broadwell)   

</details>

@ablaom For the fit on GPU, specialized functions are defined under the gpu folder, for example:

function update_grads_gpu!(loss::Linear, δ::CuMatrix{T}, p::CuMatrix{T}, y::CuVector{T}; MAX_THREADS = 1024) where {T<:AbstractFloat}

The immediate cause of issue with using GPU through MLJ is that the MLJ adapter hasn't taken into account the GPU support which came later. As such, the fitting function called is the CPU one here:

fitresult, cache = init_evotree(model, A.matrix, y)
,
while the internal fitting routine considers the selected device and , if gpu, calls:
function init_evotree_gpu(params::EvoTypes{T,U,S},

Quick fix would involve to simply add a condition based on the "device". What I'm suspect won't work smoothly is the handling of validation metrics since the in-house API handles the conversion of both train and eval data to GPU at initialization, while in MLJ, eval is treated independently. I'll need to think a bit more about the last point.

Okay, that explains it. And thanks for the diagnosis. Let me know if/when you need further support at our end.

@xgdgsc Could you give it a shot with new EvoTrees.jl v0.9.4? I think it should be working now.

Thanks! Works now.