`scitype(X)` is slow for large tables
Closed this issue · 6 comments
julia> using ScientificTypes, Tables
julia> Xsmall = Tables.table(rand(100, 10));
julia> Xbig = Tables.table(rand(10000, 10));j
julia> @time scitype(Xsmall);
julia> scitype(Xsmall);
julia> @time scitype(Xsmall)
0.004922 seconds (48 allocations: 9.594 KiB)
Table{AbstractArray{Continuous,1}}
julia> @time scitype(Xbig)
0.279894 seconds (58 allocations: 782.875 KiB)
Table{AbstractArray{Continuous,1}}
Reason for slow down
By definition, scitype(::AbstractArray) = AbstractArray{T}
where T
is the union of element scitypes. Computing the individual element scitypes and taking the union is o(N)
where N
is the number of elements. In most cases the eltype of the array already determines what this union should be, but, under the mlj convention, the scitype of a CategoricalValue
is not inferable from the machine type of the object, because the type does not say whether it is ordered or not (see JuliaData/CategoricalArrays.jl#184 ).
Suggestion
We could, in the mlj convention, overload scitype(::<:AbstractArray{T})
to infer the scitype from T
for those particular T
we know actually do determine the union scitype (basically, everything that is not a subtype of CategoricalValue
or CategoricalString
). Not sure we can do much for these troublesome types, however.
More radical, but motivated by unresolved issue JuliaData/CategoricalArrays.jl#199, is to abandon CategoricalArrays
altogether and write our own categorical element package where order
is a type parameter (and levels/pools are immutable).
Any other suggestions?
It should be easy to add an scitype(A::CategoricalArray, ::Val)
method for efficiency, instead of calling reduce
. (BTW, even for general AbstractArray{T}
, you could imagine adding a fast path for when isconcrete(T)
and the scientific type is determined by the type and doesn't depend on actual values.)
Even if ordered
was a type parameter, wouldn't the scientific type depend on the number of levels anyway?
@nalimilan Yes, that's the way to go.
Even if ordered was a type parameter, wouldn't the scientific type depend on the number of levels anyway?
Yes. But in an "immutable" vs of CategoricalArrays I would make that a type parameter too :-)
Yes. But in an "immutable" vs of CategoricalArrays I would make that a type parameter too :-)
Yeah, but after very long discussions CategoricalArrays was designed not to hardcode the number of levels, as in most situations that's not a good idea for performance (e.g. need to recompile all functions for each number of levels). One should probably use an enum when levels are hardcoded like that.
Resolved by PR #23:
julia> Xsmall = Tables.table(rand(100, 10));
julia> Xbig = Tables.table(rand(10000, 10));
julia>
julia> @time scitype(Xsmall);
0.000117 seconds (18 allocations: 9.125 KiB)
julia>
julia> scitype(Xsmall);
julia> @time scitype(Xsmall)
0.000123 seconds (18 allocations: 9.125 KiB)
Table{AbstractArray{Continuous,1}}
julia> @time scitype(Xbig)
0.000766 seconds (28 allocations: 782.406 KiB)
Table{AbstractArray{Continuous,1}}
Re #mlj slack discussion https://julialang.slack.com/archives/CC57ZE7EY/p1569005060012000 and this issue JuliaAI/MLJModels.jl#63:
julia> @time X = MLJBase.table(randn(5_000, 50));
0.002304 seconds (370 allocations: 1.929 MiB)
julia> @time m = machine(Standardizer(), X)
0.001994 seconds (145 allocations: 1.913 MiB)
Machine{Standardizer} @ 7…74
julia> @time fit!(m)
[ Info: Training Machine{Standardizer} @ 7…74.
0.005042 seconds (425 allocations: 3.845 MiB)
Machine{Standardizer} @ 7…74
julia> @time transform(m, X);
0.089879 seconds (5.87 k allocations: 97.647 MiB, 20.16% gc time)
So more sensible.
Fantastic!