joshday/OnlineStatsBase.jl

GroupBy Group

Closed this issue · 3 comments

I was hoping to be able to do a collection of stats on each group with one pass. For example, something like o = fit!(GroupBy(eltype(group_key), Group(Mean(), Variance(), Extrema())), zip(group_key, obs)).

However, this example errors out with what appears to be a recursive iteration that arrives at the group keys, I can't see how else OnlineStatsBase would get to evaluate Char which only exists in the group keys of my example.

I'm not certain whether this is a bug report, feature request, or misuse on my part. Any assistance would be much appreciated. Also, this issue seems to be similar to joshday/OnlineStats.jl#145.

Below is a complete mockup.

julia> using OnlineStats

# Mockup data.
julia> vals = [1 2 3 4; 5 6 7 8; 9 10 11 12] #Note: would come from an iterator.
3×4 Array{Int64,2}:
 1   2   3   4
 5   6   7   8
 9  10  11  12

julia> (nrows, ncols) = size(vals)
(3, 4)

julia> attr1 = repeat(["a", "s", "a"], outer=ncols) #Note: would come from an iterator.
12-element Array{String,1}:
 "a"
 "s"
 "a"
 "a"
 "s"
 "a"
 "a"
 "s"
 "a"
 "a"
 "s"
 "a"

julia> attr2 = repeat(["q", "w", "e", "r"], inner=nrows) #Note: would come from an iterator.
12-element Array{String,1}:
 "q"
 "q"
 "q"
 "w"
 "w"
 "w"
 "e"
 "e"
 "e"
 "r"
 "r"
 "r"

# Define group keys.
julia> group_key = zip(attr1, attr2)
Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}}((["a", "s", "a", "a", "s", "a", "a", "s", "a", "a", "s", "a"], ["q", "q", "q", "w", "w", "w", "e", "e", "e", "r", "r", "r"]))

julia> eltype(group_key)
Tuple{String,String}

# Mockup iterator.
julia> iter = zip(group_key, vals)
Base.Iterators.Zip{Tuple{Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}},Array{Int64,2}}}((Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}}((["a", "s", "a", "a", "s", "a", "a", "s", "a", "a", "s", "a"], ["q", "q", "q", "w", "w", "w", "e", "e", "e", "r", "r", "r"])), [1 2 3 4; 5 6 7 8; 9 10 11 12]))

julia> eltype(iter)
Tuple{Tuple{String,String},Int64}

julia> (i1, state) = iterate(iter)
((("a", "q"), 1), ((2, 2), 2))

julia> (i2, state) = iterate(iter, state)
((("s", "q"), 5), ((3, 3), 3))

julia> (i3, state) = iterate(iter, state)
((("a", "q"), 9), ((4, 4), 4))

# Of the 12 observations, there are 8 unique groups.
julia> first.(collect(iter)) |> unique
8-element Array{Tuple{String,String},1}:
 ("a", "q")
 ("s", "q")
 ("a", "w")
 ("s", "w")
 ("a", "e")
 ("s", "e")
 ("a", "r")
 ("s", "r")

# Setup stat.
julia> o = fit!(GroupBy(eltype(group_key), Mean()), iter)
GroupBy: Tuple{String,String} => Mean{Float64,EqualWeight}
  ├── ("a", "q"): Mean: n=2 | value=5.0
  ├── ("s", "q"): Mean: n=1 | value=5.0
  ├── ("a", "w"): Mean: n=2 | value=6.0
  ├── ("s", "w"): Mean: n=1 | value=6.0
  ├── ("a", "e"): Mean: n=2 | value=7.0
  ├── ("s", "e"): Mean: n=1 | value=7.0
  ├── ("a", "r"): Mean: n=2 | value=8.0
  └── ("s", "r"): Mean: n=1 | value=8.0

julia> o_desired = fit!(GroupBy(eltype(group_key), Group(Mean(), Variance(), Extrema())), iter)
ERROR: The input for GroupBy is a Union{Pair{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}, NamedTuple{names,Tuple{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}}, Tuple{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}} where names.  Found Char.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] fit!(::GroupBy{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names,Group{Tuple{Mean{Float64,EqualWeight},Variance{Float64,EqualWeight},Extrema{Float64,Number}},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}}, ::Char) at /Users/comara/.julia/packages/OnlineStatsBase/L6i9N/src/OnlineStatsBase.jl:108
 [3] fit! at /Users/comara/.julia/packages/OnlineStatsBase/L6i9N/src/OnlineStatsBase.jl:110 [inlined] (repeats 4 times)
 [4] top-level scope at REPL[29]:1

I'm aware that the results of multiple OnlineStat can be easily combined since GroupBy uses an OrderedDict. While pragmatic, this approach is less than optimal with multiple dictionaries each requiring their own lookups.

o_variance = GroupBy(eltype(group_keys), Variance()) # Note: also provides mean and standard deviation.
o_counter = GroupBy(eltype(group_keys), Counter()) #Note: n is provided by other OnlineStat.
o_sum = GroupBy(eltype(group_keys), Sum())

for i in iter
    fit!(o_variance,i)
    fit!(o_counter,i)
    fit!(o_sum,i)
end

I realise the error on my part. In my example, each group should feed into Series, not Group.

o_desired = fit!(GroupBy(eltype(group_key), Series(Mean(), Variance(), Extrema())), iter) #: corrected.

Glad you got it figured out!