JuliaAI/ScientificTypes.jl

`scitype(X)` is slow for large tables

Closed this issue · 6 comments

julia> using ScientificTypes, Tables

julia> Xsmall = Tables.table(rand(100, 10));

julia> Xbig = Tables.table(rand(10000, 10));j
julia> @time scitype(Xsmall);

julia> scitype(Xsmall);
julia> @time scitype(Xsmall)
  0.004922 seconds (48 allocations: 9.594 KiB)
Table{AbstractArray{Continuous,1}}

julia> @time scitype(Xbig)
  0.279894 seconds (58 allocations: 782.875 KiB)
Table{AbstractArray{Continuous,1}}

Reason for slow down

By definition, scitype(::AbstractArray) = AbstractArray{T} where T is the union of element scitypes. Computing the individual element scitypes and taking the union is o(N) where N is the number of elements. In most cases the eltype of the array already determines what this union should be, but, under the mlj convention, the scitype of a CategoricalValue is not inferable from the machine type of the object, because the type does not say whether it is ordered or not (see JuliaData/CategoricalArrays.jl#184 ).

Suggestion

We could, in the mlj convention, overload scitype(::<:AbstractArray{T}) to infer the scitype from T for those particular T we know actually do determine the union scitype (basically, everything that is not a subtype of CategoricalValue or CategoricalString). Not sure we can do much for these troublesome types, however.

More radical, but motivated by unresolved issue JuliaData/CategoricalArrays.jl#199, is to abandon CategoricalArrays altogether and write our own categorical element package where order is a type parameter (and levels/pools are immutable).

Any other suggestions?

It should be easy to add an scitype(A::CategoricalArray, ::Val) method for efficiency, instead of calling reduce. (BTW, even for general AbstractArray{T}, you could imagine adding a fast path for when isconcrete(T) and the scientific type is determined by the type and doesn't depend on actual values.)

Even if ordered was a type parameter, wouldn't the scientific type depend on the number of levels anyway?

@nalimilan Yes, that's the way to go.

Even if ordered was a type parameter, wouldn't the scientific type depend on the number of levels anyway?

Yes. But in an "immutable" vs of CategoricalArrays I would make that a type parameter too :-)

Yes. But in an "immutable" vs of CategoricalArrays I would make that a type parameter too :-)

Yeah, but after very long discussions CategoricalArrays was designed not to hardcode the number of levels, as in most situations that's not a good idea for performance (e.g. need to recompile all functions for each number of levels). One should probably use an enum when levels are hardcoded like that.

Resolved by PR #23:

julia> Xsmall = Tables.table(rand(100, 10));

julia> Xbig = Tables.table(rand(10000, 10));

julia> 

julia> @time scitype(Xsmall);
  0.000117 seconds (18 allocations: 9.125 KiB)

julia> 

julia> scitype(Xsmall);

julia> @time scitype(Xsmall)
  0.000123 seconds (18 allocations: 9.125 KiB)
Table{AbstractArray{Continuous,1}}

julia> @time scitype(Xbig)
  0.000766 seconds (28 allocations: 782.406 KiB)
Table{AbstractArray{Continuous,1}}

Re #mlj slack discussion https://julialang.slack.com/archives/CC57ZE7EY/p1569005060012000 and this issue JuliaAI/MLJModels.jl#63:

julia> @time X = MLJBase.table(randn(5_000, 50));
  0.002304 seconds (370 allocations: 1.929 MiB)

julia> @time m = machine(Standardizer(), X)
  0.001994 seconds (145 allocations: 1.913 MiB)
Machine{Standardizer} @ 774


julia> @time fit!(m)
[ Info: Training Machine{Standardizer} @ 774.
  0.005042 seconds (425 allocations: 3.845 MiB)
Machine{Standardizer} @ 774


julia> @time transform(m, X);
  0.089879 seconds (5.87 k allocations: 97.647 MiB, 20.16% gc time)

So more sensible.

Fantastic!