SteffenMoritz/imputeTS

How to choose the best algorithm ?

MorganePhilipp opened this issue · 2 comments

Dear @SteffenMoritz ,
I am an AgroParisTech student and with 4 of my classmates we are working on a project involving time series with many missing values.
We were thinking of using the package you developed to impute the missing values (which is really very usefull and easy to handle :) ), however we don't know which method to use. Do you know if a comparative study has already been conducted on the different imputation algorithms proposed by the package?

Kind regards,
Morgane Philipp

Hello @MorganePhilipp ,

choosing the right algorithm is (similar to other machine learning task) quite dependent from the data you have. In general na_kalman is quite a good choice (but also the most computationally intensive, thus will not work for all datasets).

If you go through the list of papers citing imputeTS (https://scholar.google.com/scholar?um=1&ie=UTF-8&lr&cites=16876364094503492919) you will find some works doing comparisons with these algorithms. Unfortunately, none is a general algorithm comparison for different kinds of scenarios and data. Most of these comparisons are from authors introducing and benchmarking their own algorithms. Thus, the papers in these studies will (understandingly) mostly highlight the specific kind of data/scenario, where their new algorithm might be an improvement.

In general, you have to do a benchmark for your own data of different algorithms to find the one suited best.
Since the missing data are ultimately lost and you won't ever have a ground truth for these. So you have to find out from you existing data, which imputation method is best for your dataset.

The following procedure can be applied:

  1. Artificially create NAs in your existing data (for which you then know the ground truth)
    Here it is important to simulate the occurrence NAs similar to their real occurrence.
    If you always have long NA gaps with multiple consecutive missing values, simulate long gaps.

  2. Apply different imputation methods on the time series with the artificially missing values.
    Since you have the ground truth for your artificially introduced missing values, you can calculate an error/performance metric like RMSE, MAE.

  3. Do multiple simulation runs, so that the artificially introduced missing values are placed at several different locations. Each time calculate error metrics for the imputation algorithms you want to compare.

  4. Create an overall results table

Dear @SteffenMoritz,

Thank you very much for your detailed answer which confirms our initial intuition and the link to the list of different publications.

We were indeed thinking of using a method of this type. Thank you very much for giving us all the steps. We will implete this with our data. :)

Kind regards,
Morgane PHILIPP