Does not handle NaN
Closed this issue · 4 comments
Hi
I am trying out EvoTrees for Binary classification task as below. it turns out, however, that it does not support NaNs. is there a reason why to doesn't? I currently XGBoost for my model and it handles NaNs. the primary issue is that I have missing data in my dataset and some of the features are NaNs for some values (correctly so).
config = EvoTreeRegressor(
loss=:logistic,
metric = :logloss,
nrounds=100,
nbins = 32,
lambda = 0.5,
gamma=0.1, eta=0.1,
max_depth = 6,
rowsample=0.5,
colsample=1.0)
model = fit_evotree(config; x_train=X, y_train=Y, x_eval=testX, y_eval=testY, print_every_n = 25)
Is there a plan for EvoTrees to handle NaNs please? I am also curious as to how others handle NaNs? Imputing or mean/median? those approaches wont work for me, I am afraid. anything else I can try out please?
ta!
It wasn't on my radar to have support for the NAs/missings. Typically, inputing would work (mean/media, or min/max), or otherwise creation of indicator variable [0, 1] if either missing or not. Any reason why such method wouldn't be applicable to your situation?
Imputing wouldn’t work. I have NaNs or missing in instances where I don’t have enough data to calculate features. For instance, the first few features would be NaN or missing.
I am afraid I am not aware of indicator variables - can you please elaborate? My features are continuous values
I hope we can come up with a solution for this, I am quite impressed with the performance, am looking forward to using this package
Thanks!
By indicator variable, I mean something similar to:
julia> df = DataFrame(v1 = [missing, rand(3)...])
4×1 DataFrame
Row │ v1
│ Float64?
─────┼─────────────────
1 │ missing
2 │ 0.777368
3 │ 0.0461273
4 │ 0.71682
julia> transform!(df, "v1" => ByRow(ismissing) => "v1_flag")
4×2 DataFrame
Row │ v1 v1_flag
│ Float64? Bool
─────┼──────────────────────────
1 │ missing true
2 │ 0.777368 false
3 │ 0.0461273 false
4 │ 0.71682 false
Then, you need to make an imputation for the original variable, could be 0, mean, or other relevant value:
julia> transform!(df, "v1" => ByRow(x -> ismissing(x) ? 0.0 : x) => "v1")
4×2 DataFrame
Row │ v1 v1_flag
│ Float64 Bool
─────┼────────────────────
1 │ 0.0 true
2 │ 0.777368 false
3 │ 0.0461273 false
4 │ 0.71682 false
Such kind of approach should typically cover most use cases.
ahh, I understand. thanks for this. I will give this a shot, although, I will have to think more about the replacement values. I am working with Financial data, so, 0s are valid values. Imputing data with mean, median etc, would be misleading.
thanks
Roh