SteffenMoritz/imputeTS

na_kalman is slow for long time series

Opened this issue · 8 comments

Hi @SteffenMoritz

Thanks for the amazing package ImputeTS. However, I found it to be slow when imputing long time series (~3000 daily data) with na_kalman(x, model = "StructTS")

Below is the reproducible example:
series <- ts(rnorm(3000), start = c(2000, 1), frequency = 365.25)
sample <- sample(1:3000, 900)
series[sample] <- NA
na_kalman(series)

Related Stackoverflow question:
https://stackoverflow.com/questions/52841828/why-is-imputets-hanging-taking-so-long-to-na-kalman-this-data-set

Is there any planned development to solve these performance issues when encountering long time series?

Hey @jonekeat sorry for the late answer, been quite busy.

I definitely know about these performance issues of na.kalman and I am also not very satisfied with them.

In comparison to the other methods, na.kalman is of course also just more complex and will probably always be way slower than e.g. na.interpolation.

What I would have to do is to change to another implementation or library for the Kalman Filter related things. (this is the current bottle neck). Nice benefit: this would also solve another issue with a certain error occurring for an edge case.

I have this in mind and I think an improvement will come - but don't expect it short term.
The next package version will be mainly a plot update to have better looking plots.

Maybe this package could help with performance?

This looks interesting @trafficonese - could solve our problems. Any idea, why it isn't on CRAN?

I'd say, since rcppkalman is not on CRAN the easiest solution to integrate these functions would be to copy the needed sources files (only these) into the imputeTS package.

On the rcppkalman Github it says:
GNU General Public License, Version 2 or later. EKF/UKF itself (which is included) is released under the GNU General Public License, Versions 2 and 3.

I'd guess if we mark the origin of the copied files in the source code this should be alright, since imputeTS is also GNU GPL 3. What do you think about this @trafficonese

It might be worth a try to test the functions and benchmark them with the current solution.

For now, I was testing the given example and found that it's actually the function stats::StructTS which is slow for time-series with high frequencies.

When you do the original example with frequency = 1 it works quite fast, but as soon as it goes above ~25, it freezes my R session.

rcppkalman still isn't on CRAN, but seems they are actively developing it.

I also found Package ‘FKF’ Fast Kalman Filter. I'll have a look if this can be used.

Line 200 of file na_kalman.R:
mod <- stats::StructTS(data, ...)$model0
where model0 is the initial state of filter, not the final state fitted by maximum likelihood. The latter will consume a lot of time, but calling function stats::StructTS must perform mle estimation.

Ah nice find. You just realized you first wrote here and then in a new thread.

I see your point - when looking at the StructTS source code it seems indeed, there is one unecesary call to
KalmanRun to perform Kalman Filtering to find the (Gaussian) log-likelihood. Which is not needed for model0, but only for model.

I think I could edit the source code of StructTS remove the unnecessary parts and add it to the imputeTS package - but that would mean I have to maintain it on my own then.

It also seems to my, this part does not really contribute that much to computing time.
I will have to spend a little bit more time about this, but my current tries at profiling the function seemed to show that the

optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, 
        np + 1L), upper = rep(Inf, np + 1L), control = optim.control)

part in StructTS consumes nearly ALL of the computing time ...with everything else basically being irrelevant.

But I have to check this again...the profiling was only a quick try.

Hi @SteffenMoritz

I hope this message will get you well.

I am trying to do missing imputing using:

na_kalman(forex_ts, model = "auto.arima", smooth = TRUE)

and

forex_kalman1<- na_kalman(forex_ts, model = "StructTS", smooth = TRUE)

but it is taking a long time to perform the run (more than 3 hours), my laptop is new (Macbook pro 2020).

Is there are any suggestions to reduce some features (e.g. reducing iteration or something like that)?

Regards,

Ahmad