Evovest/EvoTrees.jl

Document how missing values should be handled by user

Closed this issue · 7 comments

I checked this part of your tutorial:

https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L34

and

https://github.com/Evovest/EvoTrees.jl/blob/main/docs/src/tutorials/logistic-regression-titanic.md?plain=1#L35

And it was not fully clear for me what is the recommended practice for both cases from the package maintainers.
I.e. what should be the canonical way to preprocess string variables and the canonical way to handle missing.
(for example in case of missing probably, if such a replacement as suggested in the docs is done another 0-1 feature indicating where a missing value was would be added to avoid loosing information).

Also, thank you for using DataFrames.jl :). From this perspective you could write (this is a mild suggestion):

transform!(df, :Sex => ByRow(!=("male")) => :Sex) # gives Bool column, but I guess it is OK for your package
transform!(df, :Age => (x -> coalesce.(x, mean(skipmissing(df.Age)))) => :Age) # this will be more efficient as it avoids computing the mean repeatedly

or maybe just simply:

df.Sex = df.Sex .!== "male"
df.Age = coalesce.(df.Age, mean(skipmissing(df.Age)))

See for the second performance point:

julia> x = rand([1.0, 2.0, missing], 10^4);

julia> f1(v) = ismissing(v) ? mean(skipmissing(x)) : v
f1 (generic function with 1 method)

julia> f2(x) = coalesce.(x, mean(skipmissing(x)))
f2 (generic function with 1 method)

julia> @time f1.(x);
  0.107405 seconds (20.01 k allocations: 390.969 KiB)

julia> @time f1.(x);
  0.110699 seconds (20.01 k allocations: 390.969 KiB)

julia> @time f2(x);
  0.000142 seconds (2 allocations: 78.172 KiB)

julia> @time f2(x);
  0.000124 seconds (2 allocations: 78.172 KiB)

Thank you.

I would consider adding the following details to the documentation (or changing the behavior of the package):

  • for features: currently if eltype of feature allows Missing then the feature is dropped; I recommend that you either document this (i.e. that the user needs to call e.g. disallowmissing function on the input) or that you actually do the check if there are any missing; this is a more general issue, e.g. if column eltype is Any (which sometimes happens in practice) users should be informed what to do (simplest is to do identity.(col) to force establishment of a more precise eltype)
  • you say in the docs "Tables API supports input that are either Real, Bool or Categorical" but Bool <: Real so it would be good to explain why Bool is explicitly mentioned (I assume it is handled differently internally?)
  • having said that if we have missing value for unordered categorical variable then I think you could allow missing in such a variable (and just treat it as a separate level); currently you drop it - this is a soft recommendation; also I would make a separate recommendation for case of missing in Categorical column (as then probably it is better to change missing to a separate level given your current implementation of the package) - the recommendation with a dummy variable is good for Real variables. (I hope it is clear what I want to say here 😄)
  • the requirement you specified applies to FEATURES. For target actually eltype is not checked, but values are checked (still missing is disallowed). I would make it more precise what input is assumed for target input.

@jeremiedb - any thoughts on this? I would make a blogpost about updates of EvoTrees after you decide what to do with this issue and tag a release. Hopefully it could help promote this excellent package.

For now my take would be to add a "Missing data" section in the docs (along the Reproducibility one) that clarifies the behavior of the algo. I'm for now reluctant to perform further transformations to the input data or make any assumption of what the intent of the user would have been. My perspetive is for ML algos to be limited to the algo part, while the handling of missings and the likes to be handled by the preprocessing part, which I conceive as a topic of its own within a modeling pipeline. So I'd prefer to direct users to MLJ, TableTransforms or of self-defined preprocessing.
I agree though on importance add clarification on the supported feature and column eltypes. I'll look to have those docs updated within 2-3 days.

Sure - if docs are precise what is done in algo and what has to be done in pre-processing this is also OK.

Let me know if you think the above PR provides satisfying clarificationson the handling of missings: https://evovest.github.io/EvoTrees.jl/dev/#Missing-values

Looks good. Thank you!