Notice: StreamStats.jl has been deprecated in favor of OnlineStats.jl. OnlineStats has a superset of the features available in StreamStats and development is active.
Compute statistics from a stream of data. Useful when:
- Interim statistics must be available before the stream is fully processed
- Analysis of data must use no more than O(1) memory
- Many streams of data must be processed in parallel and results later merged
Every statistic is constructed as a mutable object that updates state with each new observation:
using StreamStats
var_x = StreamStats.Var()
var_y = StreamStats.Var()
cov_xy = StreamStats.Cov()
xs = randn(10)
ys = 3.1 * xs + randn(10)
for (x, y) in zip(xs, ys)
update!(var_x, x)
update!(var_y, y)
update!(cov_xy, x, y)
@printf("Estimated covariance: %f\n", state(cov_xy))
end
state(var_x), var(var_x), std(var_x)
state(cov_xy), cov(cov_xy), cor(cov_xy)
As you can see, you update statistics using the update!
function and
extract the current estimate using the state
function, or
- StreamStats.Mean
- StreamStats.Var
- StreamStats.Moments
- StreamStats.Min
- StreamStats.Max
- StreamStats.ApproxDistinct
- StreamStats.Cov
- StreamStats.Sample
- StreamStats.ApproxOLS
- StreamStats.ApproxLogit
It is also possible to estimate confidence intervals for online statistics using online bootstrap methods:
using StreamStats
stat = StreamStats.Cov()
ci1 = StreamStats.BootstrapBernoulli(stat, 1_000, 0.05)
ci2 = StreamStats.BootstrapPoisson(stat, 1_000, 0.05)
xs = randn(100)
ys = randn(100)
for (x, y) in zip(xs, ys)
update!(stat, x, y)
update!(ci1, x, y)
update!(ci2, x, y)
end
state(stat), state(ci1), state(ci2)
Given any other statistic object, you can use the BootstrapBernoulli
or
BootstrapPoisson
types to estimate a confidence interval. These types require
that you specify the number of bootstrap replicates (i.e. 1_000
) and the error
rate for nominal coverage of the confidence interval (i.e. 0.05
).
The code for computing moments from a stream is derived from John D. Cook's code for computing the skewness and kurtosis of a data stream online.