GroupBy Group
Closed this issue · 3 comments
I was hoping to be able to do a collection of stats on each group with one pass. For example, something like o = fit!(GroupBy(eltype(group_key), Group(Mean(), Variance(), Extrema())), zip(group_key, obs))
.
However, this example errors out with what appears to be a recursive iteration that arrives at the group keys, I can't see how else OnlineStatsBase
would get to evaluate Char
which only exists in the group keys of my example.
I'm not certain whether this is a bug report, feature request, or misuse on my part. Any assistance would be much appreciated. Also, this issue seems to be similar to joshday/OnlineStats.jl#145.
Below is a complete mockup.
julia> using OnlineStats
# Mockup data.
julia> vals = [1 2 3 4; 5 6 7 8; 9 10 11 12] #Note: would come from an iterator.
3×4 Array{Int64,2}:
1 2 3 4
5 6 7 8
9 10 11 12
julia> (nrows, ncols) = size(vals)
(3, 4)
julia> attr1 = repeat(["a", "s", "a"], outer=ncols) #Note: would come from an iterator.
12-element Array{String,1}:
"a"
"s"
"a"
"a"
"s"
"a"
"a"
"s"
"a"
"a"
"s"
"a"
julia> attr2 = repeat(["q", "w", "e", "r"], inner=nrows) #Note: would come from an iterator.
12-element Array{String,1}:
"q"
"q"
"q"
"w"
"w"
"w"
"e"
"e"
"e"
"r"
"r"
"r"
# Define group keys.
julia> group_key = zip(attr1, attr2)
Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}}((["a", "s", "a", "a", "s", "a", "a", "s", "a", "a", "s", "a"], ["q", "q", "q", "w", "w", "w", "e", "e", "e", "r", "r", "r"]))
julia> eltype(group_key)
Tuple{String,String}
# Mockup iterator.
julia> iter = zip(group_key, vals)
Base.Iterators.Zip{Tuple{Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}},Array{Int64,2}}}((Base.Iterators.Zip{Tuple{Array{String,1},Array{String,1}}}((["a", "s", "a", "a", "s", "a", "a", "s", "a", "a", "s", "a"], ["q", "q", "q", "w", "w", "w", "e", "e", "e", "r", "r", "r"])), [1 2 3 4; 5 6 7 8; 9 10 11 12]))
julia> eltype(iter)
Tuple{Tuple{String,String},Int64}
julia> (i1, state) = iterate(iter)
((("a", "q"), 1), ((2, 2), 2))
julia> (i2, state) = iterate(iter, state)
((("s", "q"), 5), ((3, 3), 3))
julia> (i3, state) = iterate(iter, state)
((("a", "q"), 9), ((4, 4), 4))
# Of the 12 observations, there are 8 unique groups.
julia> first.(collect(iter)) |> unique
8-element Array{Tuple{String,String},1}:
("a", "q")
("s", "q")
("a", "w")
("s", "w")
("a", "e")
("s", "e")
("a", "r")
("s", "r")
# Setup stat.
julia> o = fit!(GroupBy(eltype(group_key), Mean()), iter)
GroupBy: Tuple{String,String} => Mean{Float64,EqualWeight}
├── ("a", "q"): Mean: n=2 | value=5.0
├── ("s", "q"): Mean: n=1 | value=5.0
├── ("a", "w"): Mean: n=2 | value=6.0
├── ("s", "w"): Mean: n=1 | value=6.0
├── ("a", "e"): Mean: n=2 | value=7.0
├── ("s", "e"): Mean: n=1 | value=7.0
├── ("a", "r"): Mean: n=2 | value=8.0
└── ("s", "r"): Mean: n=1 | value=8.0
julia> o_desired = fit!(GroupBy(eltype(group_key), Group(Mean(), Variance(), Extrema())), iter)
ERROR: The input for GroupBy is a Union{Pair{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}, NamedTuple{names,Tuple{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}}, Tuple{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}} where names. Found Char.
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] fit!(::GroupBy{Tuple{String,String},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names,Group{Tuple{Mean{Float64,EqualWeight},Variance{Float64,EqualWeight},Extrema{Float64,Number}},Union{Tuple{Number,Number,Number}, AbstractArray{#s28,1} where #s28<:Number, NamedTuple{names,R} where R<:Tuple{Number,Number,Number}} where names}}, ::Char) at /Users/comara/.julia/packages/OnlineStatsBase/L6i9N/src/OnlineStatsBase.jl:108
[3] fit! at /Users/comara/.julia/packages/OnlineStatsBase/L6i9N/src/OnlineStatsBase.jl:110 [inlined] (repeats 4 times)
[4] top-level scope at REPL[29]:1
I'm aware that the results of multiple OnlineStat
can be easily combined since GroupBy
uses an OrderedDict
. While pragmatic, this approach is less than optimal with multiple dictionaries each requiring their own lookups.
o_variance = GroupBy(eltype(group_keys), Variance()) # Note: also provides mean and standard deviation.
o_counter = GroupBy(eltype(group_keys), Counter()) #Note: n is provided by other OnlineStat.
o_sum = GroupBy(eltype(group_keys), Sum())
for i in iter
fit!(o_variance,i)
fit!(o_counter,i)
fit!(o_sum,i)
end
I realise the error on my part. In my example, each group should feed into Series
, not Group
.
o_desired = fit!(GroupBy(eltype(group_key), Series(Mean(), Variance(), Extrema())), iter) #: corrected.
Glad you got it figured out!