JuliaAI/ScientificTypes.jl

Either fix `scitype_union` or provide another method

Closed this issue · 7 comments

Context: JuliaAI/MLJBase.jl#111

Desired

X, y = @load_crabs; y = coerce(y, OrderedFactor);
give_the_element_scitype(y) == OrderedFactor{2}

Existing

scitype(y) == AbstractArray{OrderedFactor{2}, 1}
scitype_union(y) == Multiclass{2}

(is the second one a bug?)

Possible implementation

_get_elscitype(st::Type{AbstractArray{T,N}}) where {T,N} = T
elscitype(X::AbstractArray) = scitype(X) |> _get_elscitype

(can be extended easily to tables etc)

Result

julia> elscitype(randn(5,5))
Continuous

julia> elscitype([0, 1, missing, 1])
Union{Missing, Count}

julia> elscitype(y)
OrderedFactor{2}

The issue here is y is essentially corrupted. The elements are indeed Multiclass, because that's what the pool of each element says, but the wrapper type (CategoricalArray) - which has its own pool, is saying the type is OrderedFactor. The source of this bug is most likely the new "magic" introduced in CategoricalArrays to automatically construct a CategoricalArray wrapper when a broadcasting operation generates a vector of categorical elements.

I will investigate further.

julia> y[1].pool.ordered
false

julia> y.pool.ordered
true

This was never an issue before the performance improvements because we never looked at the wrapper type before then.

Okay, I would regard this as a bug in CategoricalArrays:

julia> v=categorical([:x, :y, :x])
3-element CategoricalArray{Symbol,1,UInt32}:
 :x
 :y
 :x

julia> vordered = categorical(v, ordered=true)
3-element CategoricalArray{Symbol,1,UInt32}:
 :x
 :y
 :x

julia> vordered[1].pool.ordered
false

Ok no so the merge does NOT fix the issue, example

using RDatasets
smarket = dataset("ISLR", "Smarket")
y = smarket.Direction
y = coerce(y, OrderedFactor)

then

julia> scitype(y)
AbstractArray{OrderedFactor{2},1}

julia> scitype_union(y)
Multiclass{2}

I should have tested this more extensively.

It looks like my request for elscitype still holds then...

Ongoing investigation

julia> y = smarket.Direction;

julia> yc = coerce(y, OrderedFactor);

julia> [scitype(el) for el in yc]
1250-element Array{DataType,1}:
 Multiclass{2}
 Multiclass{2}
 Multiclass{2}

Right... so it's an issue with scitype of an element...

julia> scitype(yc[1])
Multiclass{2}

julia> scitype(yc)
AbstractArray{OrderedFactor{2},1}

yeah....

Ok so it seems that it's the ordered! function that craps itself as noted in JuliaData/CategoricalArrays.jl#226

I've added an extra hack that forces the pool to be marked as ordered... let's hope for something better from CA soon.

With:

julia> y1 = smarket.Direction[1:10];

julia> y2 = coerce(y1, OrderedFactor);

julia> scitype_union(y2)
OrderedFactor{2}

As a side note, I still think a elscitype would be better than scitype_union which goes over all elements, while this may be warranted in some cases, in general we don't care, we just want the type that appears in scitype so the:

get_elscitype(st::Type{AbstractArray{T,N}}) where {T,N} = T
elscitype(X::AbstractArray) = scitype(X) |> _get_elscitype

that I suggested earlier, I'm opening another PR with that as I'd prefer to use that in confusion_matrix for instance, it's bound to be faster...

The "new" issue is resolved by CategoricalArrays 0.7.3:

using RDatasets
y = smarket.Direction
y = coerce(y, OrderedFactor)

julia> scitype(y)
AbstractArray{OrderedFactor{2},1}

julia> scitype_union(y)
OrderedFactor{2}

Closing as resolved.

Will comment on elscitype suggestion at #58.