hildensia/bayesian_changepoint_detection

Online, but what about streaming?

BoltzmannBrain opened this issue · 6 comments

The algorithm runs online, but with the assumption we have the length of the dataset a priori. What about streaming scenarios where we have a continuous stream of data? Is there (an efficient) way to the run the online algorithm without knowing the length of the dataset?

I'll have a look into it. I think the actual algorithm is capable of that, but the implementation is not engineered for it. I guess it will need only some refactoring and no actual fiddling around with the math.

Ok. I implemented a class with the ability to add_data(self, data). Can you have a look at 0f9ff01 within branch real_online? I haven't tested it yet, would you be so kind to give it a try?

FWIW I tried an approach where I started the run-lengths matrix with a list [[]] and would append rows and columns for each subsequent timestep. This got to be very inefficient quickly, even with <1000 data records. The key here is to figure out a way to not keep old data in the run-lengths matrix R. This can be done column-wise, only keeping the columns of the current timestep and timestep+1, but the rows still grow indefinitely.

I see. My first guess for a more efficient approximation would be, that we could ignore many R values for the sums, as they might be relatively small. I.e. if R[23, 123] < eps then R[24:, 123] can be ignored. Would that help?

A sketch:

def find_first_t_smaller_eps(R, eps):
    for t, r in enumerate(R):
        if r < eps:
           return t

# Evaluate the predictive distribution for the new datum under each of
# the parameters.  This is the standard thing from Bayesian inference.
predprobs = observation_likelihood.pdf(x)

# Find the first occurence of a probability smaller a defined epsilon
tm = find_first_t_smaller_eps(R[0:t+1, t], eps)

# Evaluate the hazard function for this interval
H = hazard_func(np.array(range(tm)))

# Evaluate the growth probabilities - shift the probabilities down and to
# the right, scaled by the hazard function and the predictive
# probabilities.
R[1:tm+1, t+1] = R[0:tm, t] * predprobs * (1-H)

# Evaluate the probability that there *was* a changepoint and we're
# accumulating the mass back down at r = 0.
R[0, t+1] = np.sum( R[0:tm, t] * predprobs * H)

# Renormalize the run length probabilities for improved numerical
# stability.
R[:tm+1, t+1] = R[:tm+1, t+1] / np.sum(R[:tm+1, t+1])

# Update the parameter sets for each possible run length.
observation_likelihood.update_theta(x)

maxes[t] = R[:, t].argmax()

Thus we ignore all probabilities smaller eps and can compute the sums more efficiently. To be more space efficient, we could even remove everything above the biggest tm.

This may work well; it's not clear to me before running it on some datasets.

I've implemented a solution as an anomaly detection algorithm in NAB, and will link it here once I clean it up a bit and push it. The main idea is because we're only interested in the points at which the data stream changes (i.e. a change point), we can keep only the data that would show us a change point for the current timestep. It actually works slightly better than using the entire preallocated R matrix 😉

The online algorithm is implemented here as an anomaly detector in NAB.