Either fix `scitype_union` or provide another method
Closed this issue · 7 comments
Context: JuliaAI/MLJBase.jl#111
Desired
X, y = @load_crabs; y = coerce(y, OrderedFactor);
give_the_element_scitype(y) == OrderedFactor{2}
Existing
scitype(y) == AbstractArray{OrderedFactor{2}, 1}
scitype_union(y) == Multiclass{2}
(is the second one a bug?)
Possible implementation
_get_elscitype(st::Type{AbstractArray{T,N}}) where {T,N} = T
elscitype(X::AbstractArray) = scitype(X) |> _get_elscitype
(can be extended easily to tables etc)
Result
julia> elscitype(randn(5,5))
Continuous
julia> elscitype([0, 1, missing, 1])
Union{Missing, Count}
julia> elscitype(y)
OrderedFactor{2}
The issue here is y
is essentially corrupted. The elements are indeed Multiclass
, because that's what the pool of each element says, but the wrapper type (CategoricalArray) - which has its own pool, is saying the type is OrderedFactor
. The source of this bug is most likely the new "magic" introduced in CategoricalArrays to automatically construct a CategoricalArray wrapper when a broadcasting operation generates a vector of categorical elements.
I will investigate further.
julia> y[1].pool.ordered
false
julia> y.pool.ordered
true
This was never an issue before the performance improvements because we never looked at the wrapper type before then.
Okay, I would regard this as a bug in CategoricalArrays:
julia> v=categorical([:x, :y, :x])
3-element CategoricalArray{Symbol,1,UInt32}:
:x
:y
:x
julia> vordered = categorical(v, ordered=true)
3-element CategoricalArray{Symbol,1,UInt32}:
:x
:y
:x
julia> vordered[1].pool.ordered
false
Ok no so the merge does NOT fix the issue, example
using RDatasets
smarket = dataset("ISLR", "Smarket")
y = smarket.Direction
y = coerce(y, OrderedFactor)
then
julia> scitype(y)
AbstractArray{OrderedFactor{2},1}
julia> scitype_union(y)
Multiclass{2}
I should have tested this more extensively.
It looks like my request for elscitype
still holds then...
Ongoing investigation
julia> y = smarket.Direction;
julia> yc = coerce(y, OrderedFactor);
julia> [scitype(el) for el in yc]
1250-element Array{DataType,1}:
Multiclass{2}
Multiclass{2}
Multiclass{2}
Right... so it's an issue with scitype
of an element...
julia> scitype(yc[1])
Multiclass{2}
julia> scitype(yc)
AbstractArray{OrderedFactor{2},1}
yeah....
Ok so it seems that it's the ordered!
function that craps itself as noted in JuliaData/CategoricalArrays.jl#226
I've added an extra hack that forces the pool to be marked as ordered... let's hope for something better from CA soon.
With:
julia> y1 = smarket.Direction[1:10];
julia> y2 = coerce(y1, OrderedFactor);
julia> scitype_union(y2)
OrderedFactor{2}
As a side note, I still think a elscitype
would be better than scitype_union
which goes over all elements, while this may be warranted in some cases, in general we don't care, we just want the type that appears in scitype
so the:
get_elscitype(st::Type{AbstractArray{T,N}}) where {T,N} = T
elscitype(X::AbstractArray) = scitype(X) |> _get_elscitype
that I suggested earlier, I'm opening another PR with that as I'd prefer to use that in confusion_matrix
for instance, it's bound to be faster...
The "new" issue is resolved by CategoricalArrays 0.7.3:
using RDatasets
y = smarket.Direction
y = coerce(y, OrderedFactor)
julia> scitype(y)
AbstractArray{OrderedFactor{2},1}
julia> scitype_union(y)
OrderedFactor{2}