JuliaAI/ScientificTypes.jl

Coercion/Scitype_union are too slow

Closed this issue ยท 4 comments

When people are reporting slow downs with machine (see for instance JuliaAI/MLJ.jl#122, also recently on Slack) it's usually a problem with scitype_union. Usually the users don't do coerce before that which can be the source of the issue. However when suggesting coerce that's also slow.

Take for instance:

using ScientificTypes, DataFrames
function foo()
    r = rand()
    r < 0.2 && return "AAA"
    r < 0.4 && return "BBB"
    r < 0.6 && return "CCC"
    r < 0.8 && return "DDD"
    return "EEE"
end
v1 = [foo() for i in 1:1_000_000];
v2 =  [foo() for i in 1:1_000_000];
df = DataFrame((x1=v1,x2=v2));
@time coerce(df, :x1=>Multiclass)

it takes around 30s on my machine. However,

@time categorical(v1);

is half a second.

So there must be a way to speed things up.

Here's another MWE kindly provided by @nilshg . Computing the schema of that dataframe takes minutes.

n   = 1_115_000
df  = DataFrame(
        rs  = rand(["A", "B", "C", "D", "E", "F", "G"], n),
        es  = rand(["1","3","5","7","8","9"], n),
        inc = rand([collect(5000:15000); [missing for _ โˆˆ 1:50]], n),
        gr  = rand([randstring(6) for _ in 1:400], n),
        year = rand(2012:2017, n),
        lt = rand([12, 24, 36, 48, 60, 72], n),
        pb = Int.(round.(rand(LogNormal(8.07, 1.116), n))),
        cb = rand(LogNormal(7.88, 1.288), n),
        ci = rand(0.03:0.01:0.17, n),
        db = rand([repeat(collect(1945:2000),100); missing], n),
        ms = rand(["1", "2", "NA"], n),
        ho = BitArray(rand([true, false], n)),
        el = Int.(round.(rand(LogNormal(1.0, 0.72), n))),
        fl = rand(n),
        ft = rand(n),
        ea = BitArray(rand([true, false], n))
        )

Actually two problems here:

  1. No performant version of scitype(::AbstractArray{<:AbstractString}). @tlienart The fix for this is actually documented: Look under "Performance note" at https://alan-turing-institute.github.io/ScientificTypes.jl/stable/#Detailed-usage-examples-1.

  2. An errant scitype_union in definition of schema. We can just use elscitype here (as I believe you already suggested).

Old performance

julia> @time coerce(df, :x1=>Multiclass);
 20.132844 seconds (3.00 M allocations: 108.735 MiB, 0.08% gc time)

New performance

julia> @time coerce(df, :x1=>Multiclass);
  2.876999 seconds (1.00 M allocations: 47.700 MiB, 0.39% gc time)

For the example of @nilshg:

julia> @time schema(df);
  0.104949 seconds (4.02 k allocations: 36.231 MiB, 3.85% gc time)

Brilliant thanks a lot, this was a real pain point in model development!