BoundsError on split_set_threads!

Hi,

I think I am facing an edge case where the tree split seems to result in a BoundsError. It has been quite tedious to come up with a reproducible example and it is not ideal since it originates from a large dataset (That I can probably share if needed). Due to the asynchronous fitting strategy of MLJ this is also hard to debug (I can't step into...). The line that throws an error is this one. Do you see any reason for which this could result in a BoundsError? I must also say that this error is stochastic since changing to rng = StableRNG(1234) for instance, does not raise.

The code and stacktrace are below (but you wont be able to reproduce without the dataset):

code:

using CSV, MLJ, DataFrames, MLJBase, EvoTrees
using StableRNGs

rng = StableRNG(123)

data = CSV.read("/Users/olivierlabayle/Downloads/pb_data.csv", DataFrame)
y = categorical(data.target)
X = data[!, Not(:target)]

evotree = EvoTreeClassifier(rng=rng)
ranges = [
    range(evotree, :max_depth, lower=5, upper=7), 
    range(evotree, :lambda, lower=1e-5, upper=10, scale=:log)
]
tuned_evotree = TunedModel(
    model=evotree,
    resampling=Holdout(shuffle=false, rng=rng),
    tuning=Grid(goal=10, rng=rng),
    range=ranges,
    measure=log_loss
    )

MLJBase.fit(tuned_evotree, 1, X, y)

stacktrace:

ERROR: BoundsError: attempt to access 335997-element Vector{UInt32} at index [335998:336120]
Stacktrace:
  [1] throw_boundserror(A::Vector{UInt32}, I::Tuple{UnitRange{Int64}})
    @ Base ./abstractarray.jl:703
  [2] checkbounds
    @ ./abstractarray.jl:668 [inlined]
  [3] view
    @ ./subarray.jl:177 [inlined]
  [4] split_set_threads!(out::Vector{UInt32}, left::Vector{UInt32}, right::Vector{UInt32}, is::SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}, x_bin::Matrix{UInt8}, feat::Int64, cond_bin::UInt8, offset::Int64)
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/find_split.jl:147
  [5] grow_tree!(tree::EvoTrees.Tree{EvoTrees.Softmax, 2, Float32}, nodes::Vector{EvoTrees.TrainNode{Float32, SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}}}, params::EvoTreeClassifier{EvoTrees.Softmax, Float32}, ∇::Matrix{Float32}, edges::Vector{Vector{Float32}}, js::Vector{UInt32}, out::Vector{UInt32}, left::Vector{UInt32}, right::Vector{UInt32}, x_bin::Matrix{UInt8}, monotone_constraints::Vector{Int32})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/fit.jl:229
  [6] grow_evotree!(evotree::EvoTree{EvoTrees.Softmax, 2, Float32}, cache::NamedTuple{(:info, :x, :y, :w, :K, :nodes, :pred, :is_in, :is_out, :mask, :js_, :js, :out, :left, :right, :∇, :edges, :x_bin, :monotone_constraints), Tuple{Dict{Symbol, Int64}, Matrix{Float32}, Vector{UInt32}, Vector{Float32}, Int64, Vector{EvoTrees.TrainNode{Float32, SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}}}, Matrix{Float32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt8}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Matrix{Float32}, Vector{Vector{Float32}}, Matrix{UInt8}, Vector{Int32}}}, params::EvoTreeClassifier{EvoTrees.Softmax, Float32})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/fit.jl:142
  [7] fit(model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, verbosity::Int64, A::NamedTuple{(:matrix, :names), Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, Vector{Symbol}}}, y::SubArray{CategoricalArrays.CategoricalValue{Bool, UInt32}, 1, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, Tuple{Vector{Int64}}, false}, w::Nothing)
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/MLJ.jl:9
  [8] fit(model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, verbosity::Int64, A::NamedTuple{(:matrix, :names), Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, Vector{Symbol}}}, y::SubArray{CategoricalArrays.CategoricalValue{Bool, UInt32}, 1, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, Tuple{Vector{Int64}}, false})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/MLJ.jl:2
  [9] fit_only!(mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}; rows::Vector{Int64}, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:680
 [10] #fit!#63
    @ ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:778 [inlined]
 [11] fit_and_extract_on_fold
    @ ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1180 [inlined]
 [12] (::MLJBase.var"#307#308"{MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, Int64})(k::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1019
 [13] mapreduce_first
    @ ./reduce.jl:419 [inlined]
 [14] _mapreduce(f::MLJBase.var"#307#308"{MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, Int64}, op::typeof(vcat), #unused#::IndexLinear, A::UnitRange{Int64})
    @ Base ./reduce.jl:430
 [15] _mapreduce_dim
    @ ./reducedim.jl:365 [inlined]
 [16] #mapreduce#765
    @ ./reducedim.jl:357 [inlined]
 [17] mapreduce
    @ ./reducedim.jl:357 [inlined]
 [18] _evaluate!(func::MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, #unused#::CPU1{Nothing}, nfolds::Int64, verbosity::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1018
 [19] evaluate!(mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{LogLoss{Float64}}, operations::Vector{typeof(predict)}, acceleration::CPU1{Nothing}, force::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1221
 [20] evaluate!(::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, ::Holdout, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{LogLoss{Float64}}, ::Vector{typeof(predict)}, ::CPU1{Nothing}, ::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1292
 [21] fit(::Resampler{Holdout}, ::Int64, ::DataFrame, ::CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}})
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1448
 [22] fit_only!(mach::Machine{Resampler{Holdout}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:680
 [23] #fit!#63
    @ ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:778 [inlined]
 [24] event!(metamodel::EvoTreeClassifier{EvoTrees.Softmax, Float32}, resampling_machine::Machine{Resampler{Holdout}, false}, verbosity::Int64, tuning::Grid, history::Nothing, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:436
 [25] #35
    @ ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:474 [inlined]
 [26] iterate
    @ ./generator.jl:47 [inlined]
 [27] _collect(c::Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, itr::Base.Generator{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, MLJTuning.var"#35#36"{Machine{Resampler{Holdout}, false}, Int64, Grid, Nothing, NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, ProgressMeter.Progress}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base ./array.jl:807
 [28] collect_similar
    @ ./array.jl:716 [inlined]
 [29] map
    @ ./abstractarray.jl:2933 [inlined]
 [30] assemble_events!(metamodels::Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, resampling_machine::Machine{Resampler{Holdout}, false}, verbosity::Int64, tuning::Grid, history::Nothing, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, acceleration::CPU1{Nothing})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:473
 [31] build!(history::Nothing, n::Int64, tuning::Grid, model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, model_buffer::Channel{Any}, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, verbosity::Int64, acceleration::CPU1{Nothing}, resampling_machine::Machine{Resampler{Holdout}, false})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:667
 [32] fit(::MLJTuning.ProbabilisticTunedModel{Grid, EvoTreeClassifier{EvoTrees.Softmax, Float32}}, ::Int64, ::DataFrame, ::CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:747
 [33] top-level scope
    @ ~/Dev/TARGENE/TargetedEstimation/sandbox.jl:23

I've managed to reduce the example to the following, again let me know how to best share the dataset if you want to reproduce:

using CSV, DataFrames, MLJBase, EvoTrees
using StableRNGs

data = CSV.read("/Users/olivierlabayle/Downloads/pb_data.csv", DataFrame)
y = categorical(data.target)
X = data[!, Not(:target)]

train, test = MLJBase.train_test_pairs(Holdout(), 1:size(X, 1), X, y)[1]
rng = StableRNG(1)
model = EvoTreeClassifier(nround=100, lambda=1e-5, max_depth=7, rng=rng)
Xtrain, ytrain = MLJBase.reformat(model, selectrows(X, train), selectrows(y, train))
MLJBase.fit(model, 1, Xtrain, ytrain)

The issue arises because offset+length(is) at line 152 is bigger than the out size.

Thanks for raising this! Could you confirm the EvoTrees's version you're using?
I suspect the bug to be tied with the new rowsamling approach introduced in v0.14, but if it also occurs on v0.13, it would change the diagnosis.

Yes this is with version v0.14.2. Out of curiosity I've tried v0.13 with 100 different random seeds and can't reproduce the bug so you are probably right!

I tried various runs, including with StableRNG with various seeds, but I couldn't get any that generated a failure.
So if you're willing to share the data, that could be most helpful (jeremie.desgagne.bouchard @ gmail.com)

@olivierlabayle Could you test current main branch?
I've pushed a fix that seems to resolve the bug you encountered.
I'll still need time to understand the root cause of the previous implementation spurious bugs, but current fix looks robust to all tests performed.

Thanks, I confirm I can't seem to be able to reproduce the bug on this dataset with main.

Did you manage to find the origin of the problem?

Not yet! It's actually quite puzzling as I failed to reduce the issue to a simpler reproducible problem. I'm afraid it's unlikely someone will be willing to investigate based on a full EvoTrees training that just hit an issue at 5th iteration.

That being, I think there are some relevant cues to help contnue the investigation. Notably, bugs appears to creep in in the update_gains! function.

If running experiments/debug-softmax-split-cpu up to

EvoTrees.jl/experiments/debug-softmax-split-cpu.jl

Line 45 in d731db1

@time m_evo = fit_evotree(params_evo; x_train, y_train, x_eval = x_train, y_eval = y_train, metric=:mlogloss, print_every_n = 1);

, the following lines

EvoTrees.jl/src/find_split.jl

Lines 335 to 336 in d731db1

@info "minimum(hR[3,:,:])" minimum(hR[3, :, :])

@info "minimum(hR2[3,:,:])" minimum(hR2[3, :, :])

print the original (faulty) and new (cumsum) min values for the weights in each bin. Both values are the same as expected throughout the first iteration, but then start to diverge on the second iteration. It may be indicate something shoud should be initialized differently between iterations, but I couldn't identity anything problematic.
On the GPU side, the inclusion of the following else condition results in a failure to run the kernel:

EvoTrees.jl/src/gpu/find_split_gpu.jl

Lines 239 to 240 in d731db1

else

gains[bin, j] = 0

. This condition isn't necessary for the algo, but the fact that it fails may be symptomatic of the issue that also affect the CPU side.

My next step would be to try if I can be successful reproducing in a MWE the failure on the GPU side, so I can submit a relvant issue.

Let me know if there's something you'd like to investigate on your end.

Thank you for the feedback! I will try to investigate further the original issue since I think I've just managed to trigger the same error on a similar dataset with v0.13.1. Since I don't know the internals it might take me some time though.

Have you encountered any new issue? With v0.15.0, I've paid closer attention to any numerical instabilities and found the new release to be realiable under all tested scenarios. I would therefore close given the significant revamp unless there're still scenarios leading to crashes.

Sorry it was just faster to move to XGBoost.jl, I think you can close this and I'll try again later when I have time

	@info "minimum(hR[3,:,:])" minimum(hR[3, :, :])
	@info "minimum(hR2[3,:,:])" minimum(hR2[3, :, :])

	else
	gains[bin, j] = 0