Online, but what about streaming?

Question

Online, but what about streaming?

BoltzmannBrain opened this issue 8 years ago · 6 comments

The algorithm runs online, but with the assumption we have the length of the dataset a priori. What about streaming scenarios where we have a continuous stream of data? Is there (an efficient) way to the run the online algorithm without knowing the length of the dataset?

Answer 1 · 2016-06-07T11:44:57.000Z

I'll have a look into it. I think the actual algorithm is capable of that, but the implementation is not engineered for it. I guess it will need only some refactoring and no actual fiddling around with the math.

Answer 2 · 2016-06-07T12:11:23.000Z

Ok. I implemented a class with the ability to add_data(self, data). Can you have a look at 0f9ff01 within branch real_online? I haven't tested it yet, would you be so kind to give it a try?

Answer 3 · 2016-06-07T16:21:01.000Z

FWIW I tried an approach where I started the run-lengths matrix with a list [[]] and would append rows and columns for each subsequent timestep. This got to be very inefficient quickly, even with <1000 data records. The key here is to figure out a way to not keep old data in the run-lengths matrix R. This can be done column-wise, only keeping the columns of the current timestep and timestep+1, but the rows still grow indefinitely.

Answer 4 · 2016-06-09T08:00:32.000Z

I see. My first guess for a more efficient approximation would be, that we could ignore many R values for the sums, as they might be relatively small. I.e. if R[23, 123] < eps then R[24:, 123] can be ignored. Would that help?

A sketch:

def find_first_t_smaller_eps(R, eps):
    for t, r in enumerate(R):
        if r < eps:
           return t

# Evaluate the predictive distribution for the new datum under each of
# the parameters.  This is the standard thing from Bayesian inference.
predprobs = observation_likelihood.pdf(x)

# Find the first occurence of a probability smaller a defined epsilon
tm = find_first_t_smaller_eps(R[0:t+1, t], eps)

# Evaluate the hazard function for this interval
H = hazard_func(np.array(range(tm)))

# Evaluate the growth probabilities - shift the probabilities down and to
# the right, scaled by the hazard function and the predictive
# probabilities.
R[1:tm+1, t+1] = R[0:tm, t] * predprobs * (1-H)

# Evaluate the probability that there *was* a changepoint and we're
# accumulating the mass back down at r = 0.
R[0, t+1] = np.sum( R[0:tm, t] * predprobs * H)

# Renormalize the run length probabilities for improved numerical
# stability.
R[:tm+1, t+1] = R[:tm+1, t+1] / np.sum(R[:tm+1, t+1])

# Update the parameter sets for each possible run length.
observation_likelihood.update_theta(x)

maxes[t] = R[:, t].argmax()

Thus we ignore all probabilities smaller eps and can compute the sums more efficiently. To be more space efficient, we could even remove everything above the biggest tm.

Answer 5 · 2016-06-09T18:02:20.000Z

This may work well; it's not clear to me before running it on some datasets.

I've implemented a solution as an anomaly detection algorithm in NAB, and will link it here once I clean it up a bit and push it. The main idea is because we're only interested in the points at which the data stream changes (i.e. a change point), we can keep only the data that would show us a change point for the current timestep. It actually works slightly better than using the entire preallocated R matrix 😉

Answer 6 · 2016-06-14T20:34:19.000Z

The online algorithm is implemented here as an anomaly detector in NAB.