SteffenMoritz/imputeTS

na.ma performance issue with trailing NA's

trafficonese opened this issue · 4 comments

I am currently facing a special case for the na.ma algorithm, where all NA values are located at the end of the vector.

This causes the algorithm to perform much slower, since it mostly spends the time in this while loop to increase ktemp and then ends up filling the next ~1000 trailing NA's with the same value and then with 0.

Here is a small example of that problem:

data <- c(sample(1:3000, 200, T), rep(NA, 2000))
res <- imputeTS::na.ma(data)
plot(res)

Could/Should we maybe check if all remaining values are NA and then just assign them the weighted mean of the k last non-NA values?

Or do you maybe have a better idea how this problem can be bypassed?

Hello trafficonese,
thanks for your input! Always great to get some feedback. Your suggestion sounds like a good idea.
(also sounds like it could be implemented quite fast with one of the next updates)

In general, the na.ma() is not too much optimized for speed in comparison to na.locf / na.interpolation.
Unfortunately loops are really slow in R (as you mentioned). While you don't recognize this for small time series (due to overhead over other things e.g. initialisations,...) - for large series this is a huge factor.

For na.locf the computation is made via c++ code - which is way faster.

See a performance comparison:
microbenchmark(imputeTS::na.locf(tsNH4), imputeTS::na.interpolation(tsNH4) , imputeTS::na.ma(tsNH4), 100)

Unit: nanoseconds
expr mean
imputeTS::na.locf(tsNH4) 201615.89
imputeTS::na.interpolation(tsNH4) 839082.87
imputeTS::na.ma(tsNH4) 240465157.81

So in the long term it might be a good idea, to also implement this function via C++ Code.
(still won't be as fast as locf since more computation is involved but I'd expect it to be factor 10-100 faster)
But unfortunately this will take a little bit longer (since I am no C++ pro).

Another mores sophisticated option could be to replace the trailing NAs with a certain pattern that is found in the previous values, but this would probably go against the idea of a moving average imputation.
If the data would include a timestamp aswell, the trailing NAs could also be replaced by the mean of a certain timestamp of previous values, but this could probably be another function on its own.

I ll try to work on a PR next week with the initial idea and in the long run, I might be able to help with translating this function to C++ (also no expert ;) )

Well, yeah you are right, would rather be a algorithm of it's own.
I think a lot of people like the na.ma and the na.mean, na.interpolation algorithms, because they understand directly and (more or less) easily how they work. Probably would rather be confusing, if I would add special features there.

But adding a more pattern is also something, I still have on my TODO List (there is even an issue):

Add pattern based imputation algorithm ToDo List enhancement
#3 opened on 4 Jun 2016 by SteffenMoritz
(#3)

But it hasn't come very far yet - beyond the idea - I didn't think deeper, how exactly it could look like.

A PR would be really great :-)
And you would be a hero, if you can help translating this function to C++!
(but be warned, the R implementation gave me some headaches ... it seemed so easy, but I had to do at least 5 bugfixes, because there was always some border case I didn't consider beforehand)

This is how the C++ looks like for the na.locf function:
https://github.com/SteffenMoritz/imputeTS/blob/master/src/locf.cpp

on CRAN!
THX