Evovest/EvoTrees.jl

Getting a warning when using multiple features

affans opened this issue · 3 comments

using EvoTrees
using EvoTrees: sigmoid, logit
using MLJBase
using RDatasets

iris = dataset("datasets", "iris")
iris[!, :is_setosa] = iris[!, :Species] .== "setosa"
features = setdiff(names(iris), ["Species", "is_setosa"])

Y, X, _ = unpack(iris, ==(:is_setosa), in(Symbol.(features)), colname -> true)
train, test = partition(eachindex(Y), 0.7, shuffle=true); # 70:30 split
tree_model = EvoTreeClassifier(
    loss=:linear, metric=:mse,
    nrounds=100, nbins = 100,
    λ = 0.5, γ=0.1, η=0.1,
    max_depth = 6, min_weight = 1.0,
    rowsample=0.5, colsample=1.0)
mach = machine(tree_model, X, Y)

results in this warning

Warning: The number and/or types of data arguments do not match what the specified model supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc EvoTreeClassifier{Float64, EvoTrees.Softmax, Int64}` to learn more about your model's requirements.
│ Commonly, but non exclusively, supervised models are constructed using the syntax `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are constructed with `machine(model, X)`. Here `X` are features, `y` a target, and `w` sample or class weights.

Why could this be happening?

The warning message returned refers to the constraint for the target variable in classification models used with the MLJ interface to be of type: target = AbstractVector{<:MLJModelInterface.Finite}.

To do so, you can use the categorical function to transform in the required format: Y = categorical(Y)
Also note that for a classifier, the loss should be softmax, not linear, although this loss argument can be ignored as the constructor for EvoTreeClassifier the usage of softmax loss. Also, the metric should be mlogloss (rather than mse).

To sum up, the following should work as intended:

using EvoTrees
using EvoTrees: sigmoid, logit
using MLJBase
using RDatasets

iris = dataset("datasets", "iris")
iris[!, :is_setosa] = iris[!, :Species] .== "setosa"
features = setdiff(names(iris), ["Species", "is_setosa"])

Y, X, _ = unpack(iris, ==(:is_setosa), in(Symbol.(features)), colname -> true)
Y_cat = categorical(Y)

train, test = partition(eachindex(Y), 0.7, shuffle=true); # 70:30 split

tree_model = EvoTreeClassifier(
    loss=:softmax, metric=:mlogloss,
    nrounds=100, nbins = 100,
    λ = 0.5, γ=0.1, η=0.1,
    max_depth = 6, min_weight = 1.0,
    rowsample=0.5, colsample=1.0, rng=123)

mach = machine(tree_model, X, Y_cat)
fit!(mach, rows=train, verbosity=1)

# predict on train data
pred_prob = predict(mach, X)
pred_mode = predict_mode(mach, X)

What's the difference between predict and predict_mode?

I think if you run @jeremiedb's code you will see the difference.

In MLJ a model that can make probabilistic predictions does so by default. So, for EvoTreeClassifier, for example, predict will output a vector yhat of UnivariateFinite objects. To get actual point predictions (classes) you can do predict_mode(Xnew), which is the same as mode.(predict(Xnew)).

(For future reference, this is probably a question for MLJ rather than EvoTrees.jl).