Online, but what about streaming?
BoltzmannBrain opened this issue · 6 comments
The algorithm runs online, but with the assumption we have the length of the dataset a priori. What about streaming scenarios where we have a continuous stream of data? Is there (an efficient) way to the run the online algorithm without knowing the length of the dataset?
I'll have a look into it. I think the actual algorithm is capable of that, but the implementation is not engineered for it. I guess it will need only some refactoring and no actual fiddling around with the math.
Ok. I implemented a class with the ability to add_data(self, data)
. Can you have a look at 0f9ff01 within branch real_online
? I haven't tested it yet, would you be so kind to give it a try?
FWIW I tried an approach where I started the run-lengths matrix with a list [[]]
and would append rows and columns for each subsequent timestep. This got to be very inefficient quickly, even with <1000 data records. The key here is to figure out a way to not keep old data in the run-lengths matrix R
. This can be done column-wise, only keeping the columns of the current timestep and timestep+1, but the rows still grow indefinitely.
I see. My first guess for a more efficient approximation would be, that we could ignore many R
values for the sums, as they might be relatively small. I.e. if R[23, 123] < eps
then R[24:, 123]
can be ignored. Would that help?
A sketch:
def find_first_t_smaller_eps(R, eps):
for t, r in enumerate(R):
if r < eps:
return t
# Evaluate the predictive distribution for the new datum under each of
# the parameters. This is the standard thing from Bayesian inference.
predprobs = observation_likelihood.pdf(x)
# Find the first occurence of a probability smaller a defined epsilon
tm = find_first_t_smaller_eps(R[0:t+1, t], eps)
# Evaluate the hazard function for this interval
H = hazard_func(np.array(range(tm)))
# Evaluate the growth probabilities - shift the probabilities down and to
# the right, scaled by the hazard function and the predictive
# probabilities.
R[1:tm+1, t+1] = R[0:tm, t] * predprobs * (1-H)
# Evaluate the probability that there *was* a changepoint and we're
# accumulating the mass back down at r = 0.
R[0, t+1] = np.sum( R[0:tm, t] * predprobs * H)
# Renormalize the run length probabilities for improved numerical
# stability.
R[:tm+1, t+1] = R[:tm+1, t+1] / np.sum(R[:tm+1, t+1])
# Update the parameter sets for each possible run length.
observation_likelihood.update_theta(x)
maxes[t] = R[:, t].argmax()
Thus we ignore all probabilities smaller eps
and can compute the sums more efficiently. To be more space efficient, we could even remove everything above the biggest tm
.
This may work well; it's not clear to me before running it on some datasets.
I've implemented a solution as an anomaly detection algorithm in NAB, and will link it here once I clean it up a bit and push it. The main idea is because we're only interested in the points at which the data stream changes (i.e. a change point), we can keep only the data that would show us a change point for the current timestep. It actually works slightly better than using the entire preallocated R matrix 😉
The online algorithm is implemented here as an anomaly detector in NAB.