iskandr/fancyimpute

NAs of one feature are replaced with same value

Make42 opened this issue · 10 comments

When using the imputers on a 2D dataset, all NAs of one feature get replaced by the same value. I doubt this is correct and remember that version 0.1.0 imputed missing values individually. What is going on?

Try the following

import pandas as pd
import numpy as np
import fancyimpute as impute
mydata = np.array([[1.5       ,        np.nan],
       [0.        ,        np.nan],
       [       np.nan, 5.        ],
       [       np.nan, 8.        ],
       [8.        ,        np.nan],
       [       np.nan, 8.        ],
       [       np.nan, 6.        ],
       [2.5       ,        np.nan],
       [0.78175529,        np.nan],
       [9.61898081,        np.nan],
       [8.17303221,        np.nan],
       [3.99782649,        np.nan],
       [4.31413827,        np.nan],
       [2.63802917,        np.nan],
       [       np.nan, 5.79704587],
       [1.44954798,        np.nan],
       [3.50952381,        np.nan],
       [0.75966692,        np.nan],
       [       np.nan, 2.39952526],
       [       np.nan, 9.0271611 ],
       [4.90864092,        np.nan],
       [       np.nan, 3.69246781],
       [7.80252068,        np.nan],
       [       np.nan, 0.96454525],
       [9.42050591,        np.nan],
       [0.59779543,        np.nan],
       [       np.nan, 0.15403438],
       [       np.nan, 6.49115475],
       [6.47745963,        np.nan],
       [2.96320806,        np.nan],
       [6.86775433,        np.nan],
       [6.25618561,        np.nan],
       [       np.nan, 7.75712679],
       [4.35858589,        np.nan],
       [5.08508655,        np.nan],
       [7.94831417,        np.nan],
       [8.11580458,        np.nan],
       [9.39001562,        np.nan],
       [       np.nan, 5.87044705],
       [3.0124633 ,        np.nan],
       [       np.nan, 1.9476429 ],
       [1.70708047,        np.nan],
       [       np.nan, 9.23379642],
       [       np.nan, 9.04880969],
       [4.38869973,        np.nan],
       [4.08719846,        np.nan],
       [       np.nan, 7.1121578 ],
       [       np.nan, 2.96675873],
       [       np.nan, 5.07858285],
       [       np.nan, 8.01014623],
       [9.28854139,        np.nan],
       [       np.nan, 2.3728358 ],
       [       np.nan, 5.46805719],
       [2.31594387,        np.nan],
       [6.79135541,        np.nan],
       [9.87982003,        np.nan],
       [9.13286828,        np.nan],
       [       np.nan, 3.3535684 ],
       [       np.nan, 7.21227499],
       [6.53757349,        np.nan],
       [7.15037078,        np.nan],
       [3.34163053,        np.nan],
       [0.30540946,        np.nan],
       [4.79922141,        np.nan],
       [6.1766639 ,        np.nan],
       [5.76721516,        np.nan],
       [       np.nan, 0.28674152],
       [       np.nan, 9.7868065 ],
       [       np.nan, 4.71088375],
       [6.81971904,        np.nan],
       [       np.nan, 0.96730026],
       [8.17547092,        np.nan],
       [       np.nan, 5.18594943],
       [       np.nan, 8.00330575],
       [4.32391504,        np.nan],
       [       np.nan, 1.73388613],
       [8.31379743,        np.nan],
       [       np.nan, 5.26875831],
       [6.56859891,        np.nan],
       [4.3165117 ,        np.nan],
       [       np.nan, 1.06216345],
       [1.98118403,        np.nan],
       [       np.nan, 9.2033204 ],
       [7.37858096,        np.nan],
       [5.47870901,        np.nan],
       [9.83052466,        np.nan],
       [       np.nan, 5.39126465],
       [       np.nan, 1.78132454],
       [9.99080395,        np.nan],
       [5.61199793,        np.nan],
       [       np.nan, 3.68916546],
       [9.81637951,        np.nan],
       [       np.nan, 3.7627221 ],
       [4.28252993,        np.nan],
       [       np.nan, 2.2618768 ],
       [5.82986383,        np.nan],
       [       np.nan, 2.6528091 ],
       [       np.nan, 7.30248792],
       [       np.nan, 1.07769015],
       [       np.nan, 8.17760559]])
data_completed = impute.IterativeImputer().fit_transform(mydata)

Ah I think I know what's going on. IterativeImputer tries to impute missing values in each column/feature from known values in all the other columns. What you have there is a particular example where that's impossible - there is not a single row that has both columns non-nan. So there is no way to learn a model from X1_known to X2_missing and vice versa. If you, for example, change the first row to [1.5 , 1.5] it will suddenly start doing something. As is, it's just giving you the initial imputation and then failing to iteratively impute further.

So this isn't a bug. I'll close the issue, but feel free to chat further if you'd like.

Does that mean that iterative imputation is not able to handle datasets in which per row each value (but one) is missing?

The same limitation seems to be the case for KNN and SoftImpute as well.

In fancyimpute version 0.1.0, these kind of datasets could be handeled by all of these - why, what changed? (This does not seem like an improvement...)

I am not sure why an earlier version was able to handle it. It doesn't make sense for it to have when thinking about how IterativeImputer works. I probably won't make time to investigate this oddity. By the way, there is now a more developed, bugless version of IterativeImputer in the scikit-learn master. I suspect it will behave the same as fancyimpute's but you should give it a try.

Does that mean that iterative imputation is not able to handle datasets in which per row each value (but one) is missing? I think that's true. It needs some examples (rows) where more than one sample is NOT missing so it can learn a function between them.

The kind of dataset you describe will probably be very hard to impute in any case. I can't think of a way to do it at all that's reasonable.

I think I understand what you mean.

After looking into it some more I have doubt in your answer after all:

In the video https://www.youtube.com/watch?v=zX-pacwVyvU I noticed that the first missing values are first imputed with an initial impute. The following iterative linear models are fit on those. Thus it should not matter that all but one value is missing for each row in my dataset.

Then I noticed that in both the old version and the new version, it is possible to say what this initial impute is supposed to be (parameter initial_strategy for the current, init_fill_method for the old version).

Ah, because of the same value for each of outputs (the mean), I get a regression without a slope and only an intercept for the models and thus it stays this way?

Shouldn't we be able to rectify this by initializing with a random value instead of the mean? Possibly with a random value drawn from a normal distribution with mean of the mean of the respective variable and variance with the variance of the respective variable.


In fancyimpute 0.1.0 MICE there is a mechanic that chooses a random subset of predictor variables instead of using all the other variables to fit the Bayesian ridge regression model, but this is not the default I have been using and would not make sense for my 2D dataset. However, there is also the default mechanic to not use the prediction of the linear regression model directly for the imputation, but use the mean and variance of the Bayesian ridge regression as the mean and variance of normal distribution from which we draw a value. This value is our imputation.

Consecutive models (in the iterative imputation scheme) are build on the respective previously imputed values instead of the initial mean initialization.

Finally all the imputed values of all these models are mean averaged to result in the final imputed values.

The new fancyimpute seems to go the same way, only that instead of averaging in the end, simply the last imputation is used. Another important difference is that the mean (instead of the normal draw) is used as imputation value as a default. To get the behavior of fancyimpute 0.1.0, we need to set the parameter sample_posterior=True.

Using sample_posterior=True, I do not get the same values (the means) as the imputed values anymore as mentioned in the question.

This sample_posterior=True behavior is similar to what I suggested above, only that the random draw is done during the iterations, not for the initialization.