Coercion/Scitype_union are too slow
Closed this issue ยท 4 comments
When people are reporting slow downs with machine
(see for instance JuliaAI/MLJ.jl#122, also recently on Slack) it's usually a problem with scitype_union
. Usually the users don't do coerce
before that which can be the source of the issue. However when suggesting coerce
that's also slow.
Take for instance:
using ScientificTypes, DataFrames
function foo()
r = rand()
r < 0.2 && return "AAA"
r < 0.4 && return "BBB"
r < 0.6 && return "CCC"
r < 0.8 && return "DDD"
return "EEE"
end
v1 = [foo() for i in 1:1_000_000];
v2 = [foo() for i in 1:1_000_000];
df = DataFrame((x1=v1,x2=v2));
@time coerce(df, :x1=>Multiclass)
it takes around 30s
on my machine. However,
@time categorical(v1);
is half a second.
So there must be a way to speed things up.
Here's another MWE kindly provided by @nilshg . Computing the schema
of that dataframe takes minutes.
n = 1_115_000
df = DataFrame(
rs = rand(["A", "B", "C", "D", "E", "F", "G"], n),
es = rand(["1","3","5","7","8","9"], n),
inc = rand([collect(5000:15000); [missing for _ โ 1:50]], n),
gr = rand([randstring(6) for _ in 1:400], n),
year = rand(2012:2017, n),
lt = rand([12, 24, 36, 48, 60, 72], n),
pb = Int.(round.(rand(LogNormal(8.07, 1.116), n))),
cb = rand(LogNormal(7.88, 1.288), n),
ci = rand(0.03:0.01:0.17, n),
db = rand([repeat(collect(1945:2000),100); missing], n),
ms = rand(["1", "2", "NA"], n),
ho = BitArray(rand([true, false], n)),
el = Int.(round.(rand(LogNormal(1.0, 0.72), n))),
fl = rand(n),
ft = rand(n),
ea = BitArray(rand([true, false], n))
)
Actually two problems here:
-
No performant version of scitype(::AbstractArray{<:AbstractString}). @tlienart The fix for this is actually documented: Look under "Performance note" at https://alan-turing-institute.github.io/ScientificTypes.jl/stable/#Detailed-usage-examples-1.
-
An errant
scitype_union
in definition ofschema
. We can just use elscitype here (as I believe you already suggested).
Old performance
julia> @time coerce(df, :x1=>Multiclass);
20.132844 seconds (3.00 M allocations: 108.735 MiB, 0.08% gc time)
New performance
julia> @time coerce(df, :x1=>Multiclass);
2.876999 seconds (1.00 M allocations: 47.700 MiB, 0.39% gc time)
For the example of @nilshg:
julia> @time schema(df);
0.104949 seconds (4.02 k allocations: 36.231 MiB, 3.85% gc time)
Brilliant thanks a lot, this was a real pain point in model development!