JuliaAI/ScientificTypes.jl

Suggestion to simplify implemenation of scitype

ablaom opened this issue · 5 comments

I have often lamented the fact that scitype cannot be a map of machine type to type, instead of object to type, because of the infamous CategoricalValue fly in the ointment. As a workaround to performance problems with arrays, we introduced Scitype, which is a map from type to type. Wouldn't if be simpler if implementing Scitype is the "fallback" responsibility of a convention, and that we only overload scitype for problematic cases like CategoricalValue? What am I missing here?

So, something like (ignoring convention distinctions):

# fallback:
scitype(X) = Scitype(typeof(X))

Scitype(::Type) = Unknown
Scitype(::Type{<:Integer}) = Count
# and so forth

# exceptions:
function scitype(X::CategoricalValue)
    N = length(pool(X))
    if isordered(X)
       return OrderedFactor{N}
    end
    return Multiclass{N}
end

To be clear, I'm not suggesting a change in the definition of scitype, only how it is implemented, although Scitype is something we may want to make part of the public interface.

What got me thinking about this is the case of parametric types like Sampleable{S} and a type I'd like to introduce, called Iterator{S} for lazy loaded data structures. Here S is the scitype of the objects sampled, or the scitype of the objects iterated, respectively. How to we implement scitypes for these? This is tricky because we may not have an object from which to extract the parameter S, only its machine type. So in this case we are limited to using Scitype.

Thoughts @OkonSamuel @tlienart

Of course, this comment does not address the other "fly in the ointment" which is tables, requiring trait dispatch.

I'm probably not getting this suggestion 100% in terms of how it is different from what is there now sorry, could you maybe show what it would give in the case of Sampleable{S} before and after your suggestion?

I have also been thinking down this line and I like this idea. It makes the codebase simpler.

So I guess this would imply that scitype(Any[1,2,3]) should return Array{Unknown, 1} since Scitype would operate on the type information.
This is would be more efficient than the current implementation which peeks into the type of each element in the array at runtime.

Well, no, I'm not suggesting we change the definition of scitype for arrays (see Property 3) - only the implementation. According to the definition, we will need to look inside to determine the scitype in the case of Any eltype, and any other eltype for which Scitype(T) returns Unknown.

We could change the definition, but this would pose some problems. For example, if scitype(Any[1.0, 2.0, 3.0]) = AbstractVector{Unknown}, then it forces users to tighten their element types when this is not strictly necessary to use the algorithms (an objection raised by @tlienart, if I remember correctly). Also, what should the new definition be that still allows us to recover scitype(categorical(1, 2, 3)) = AbstractVector{Multiclass{3}}?

Perhaps you have a simple replacement for Property 3 that allows us to get everything we want?