Evovest/EvoTrees.jl

Increasing max_depth causes memory leak

john-waczak opened this issue · 4 comments

I have been able to train an EvoTreeRegressor with the default parameters successfully. When I try to increase the max_depth parameter beyond 10 suddenly my memory usage spikes and Julia dies.

Here's a snippet from the REPL

julia> evo = EvoTreeRegressor(max_depth=15, rng=42)
EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 10,
    λ = 0.0,
    γ = 0.0,
    η = 0.1,
    max_depth = 15,
    min_weight = 1.0,
    rowsample = 1.0,
    colsample = 1.0,
    nbins = 64,
    α = 0.5,
    metric = :mse,
    rng = MersenneTwister(42),
    device = "cpu")

julia> mach = machine(evo, Xtrain, CDOM_train)
Machine{EvoTreeRegressor{Float64,…},…} trained 0 times; caches data
  args: 
    1:  Source @710 ⏎ `Table{AbstractVector{Continuous}}`
    2:  Source @134 ⏎ `AbstractVector{Continuous}`


julia> fit!(mach, verbosity=2)
[ Info: Training Machine{EvoTreeRegressor{Float64,…},…}.

Process julia killed

@john-waczak Thanks for reporting! Good to know about this.

A complete minimum working example might speed up resolution, ideally without the the MLJ wrap.

Okay, here's a MWE. I crash when running the following on Ubuntu 21.04 machine w/ 16GB ram and 4 core i7-7700HQ @ 2.80GHz

using EvoTrees

# Simple Regression Demo
n=2000;
X = 2*(rand(n,2) .- 0.5);

y = X[:,1].^5 + X[:,2].^4 - X[:,1].^4 - X[:,2].^3

size(X)
size(y)

# train for first time with default settings
params1 = EvoTreeRegressor()
model = fit_evotree(params1, X, y)

# train wit increased max_depth
# this causes julia to crash
params2 = EvoTreeRegressor(max_depth=20)
model = fit_evotree(params2, X, y) 

Here's the output of Pkg.status:

(evoTree_bug) pkg> status
      Status `~/gitRepos/evoTree_bug/Project.toml`
  [f6006082] EvoTrees v0.8.4

Here's a screenshot of my memory usage:
image

Thanks for reporting!
For what I can tell, it doesn't seem an issue per se or a memory leak, but more of a consequence of the design choices geared toward fitting speed which results in significant memory pre-allocations. Specifically, histograms for each tree nodes are pre-allocated, and in the case of a depth of 20, there are over 500K such nodes. What appears like a memory leak is actually a long pre-allocation process.

However, in gradient boosted model, each tree act as a weak learner and as such, I'm not aware of situation where depth much greater than 10 were of any value. Typically, a depth in the 3-8 range will best perform. Let me know if you are in a situation where greater depth is needed. I'm afraid though a significantly different design, potentially less efficient, would be needed to support such scenarios,

@jeremiedb Thanks for your reply! That makes a lot of sense. I think I should be more than fine with a smaller max_depth. I was trying some hyper-parameter variation just to see what would happen and noticed the script kept dying once it got past 10 or so.