Does not handle NaN

Question

Does not handle NaN

Closed this issue a year ago · 4 comments

Hi

I am trying out EvoTrees for Binary classification task as below. it turns out, however, that it does not support NaNs. is there a reason why to doesn't? I currently XGBoost for my model and it handles NaNs. the primary issue is that I have missing data in my dataset and some of the features are NaNs for some values (correctly so).

config = EvoTreeRegressor(
    loss=:logistic, 
    metric = :logloss,
    nrounds=100, 
    nbins = 32,
    lambda = 0.5, 
    gamma=0.1, eta=0.1,
    max_depth = 6,
    rowsample=0.5, 
    colsample=1.0)

model = fit_evotree(config; x_train=X, y_train=Y, x_eval=testX, y_eval=testY, print_every_n = 25)

Is there a plan for EvoTrees to handle NaNs please? I am also curious as to how others handle NaNs? Imputing or mean/median? those approaches wont work for me, I am afraid. anything else I can try out please?

ta!

Answer 1 · 2023-03-31T02:07:53.000Z

It wasn't on my radar to have support for the NAs/missings. Typically, inputing would work (mean/media, or min/max), or otherwise creation of indicator variable [0, 1] if either missing or not. Any reason why such method wouldn't be applicable to your situation?

Answer 2 · 2023-03-31T10:37:10.000Z

Imputing wouldn’t work. I have NaNs or missing in instances where I don’t have enough data to calculate features. For instance, the first few features would be NaN or missing.

I am afraid I am not aware of indicator variables - can you please elaborate? My features are continuous values

I hope we can come up with a solution for this, I am quite impressed with the performance, am looking forward to using this package

Thanks!

Answer 3 · 2023-04-01T04:57:26.000Z

By indicator variable, I mean something similar to:

julia> df = DataFrame(v1 = [missing, rand(3)...])
4×1 DataFrame
 Row │ v1
     │ Float64?
─────┼─────────────────
   1 │ missing
   2 │       0.777368
   3 │       0.0461273
   4 │       0.71682

julia> transform!(df, "v1" => ByRow(ismissing) => "v1_flag")
4×2 DataFrame
 Row │ v1               v1_flag
     │ Float64?         Bool
─────┼──────────────────────────
   1 │ missing             true
   2 │       0.777368     false
   3 │       0.0461273    false
   4 │       0.71682      false

Then, you need to make an imputation for the original variable, could be 0, mean, or other relevant value:

julia> transform!(df, "v1" => ByRow(x -> ismissing(x) ? 0.0 : x) => "v1")

4×2 DataFrame
 Row │ v1         v1_flag
     │ Float64    Bool
─────┼────────────────────
   1 │ 0.0           true
   2 │ 0.777368     false
   3 │ 0.0461273    false
   4 │ 0.71682      false

Such kind of approach should typically cover most use cases.

Answer 4 · 2023-04-01T13:18:33.000Z

ahh, I understand. thanks for this. I will give this a shot, although, I will have to think more about the replacement values. I am working with Financial data, so, 0s are valid values. Imputing data with mean, median etc, would be misleading.

thanks
Roh