NAs of one feature are replaced with same value

Question

NAs of one feature are replaced with same value

Make42 opened this issue 6 years ago · 10 comments

When using the imputers on a 2D dataset, all NAs of one feature get replaced by the same value. I doubt this is correct and remember that version 0.1.0 imputed missing values individually. What is going on?

Answer 1 · 2019-02-14T15:28:14.000Z

Do you have a code snippet to replicate this?

…

On Thu, Feb 14, 2019, 7:17 AM Make42 ***@***.*** wrote: When using the imputers on a 2D dataset, all NAs of one feature get replaced by the same value. I doubt this is correct and remember that version 0.1.0 imputed missing values individually. What is going on? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#92>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7OToepu7RVP2ISW-b1QB5SSG_2D7ks5vNX4LgaJpZM4a7z-r> .

Answer 2 · 2019-02-15T08:45:12.000Z

Try the following

import pandas as pd
import numpy as np
import fancyimpute as impute
mydata = np.array([[1.5       ,        np.nan],
       [0.        ,        np.nan],
       [       np.nan, 5.        ],
       [       np.nan, 8.        ],
       [8.        ,        np.nan],
       [       np.nan, 8.        ],
       [       np.nan, 6.        ],
       [2.5       ,        np.nan],
       [0.78175529,        np.nan],
       [9.61898081,        np.nan],
       [8.17303221,        np.nan],
       [3.99782649,        np.nan],
       [4.31413827,        np.nan],
       [2.63802917,        np.nan],
       [       np.nan, 5.79704587],
       [1.44954798,        np.nan],
       [3.50952381,        np.nan],
       [0.75966692,        np.nan],
       [       np.nan, 2.39952526],
       [       np.nan, 9.0271611 ],
       [4.90864092,        np.nan],
       [       np.nan, 3.69246781],
       [7.80252068,        np.nan],
       [       np.nan, 0.96454525],
       [9.42050591,        np.nan],
       [0.59779543,        np.nan],
       [       np.nan, 0.15403438],
       [       np.nan, 6.49115475],
       [6.47745963,        np.nan],
       [2.96320806,        np.nan],
       [6.86775433,        np.nan],
       [6.25618561,        np.nan],
       [       np.nan, 7.75712679],
       [4.35858589,        np.nan],
       [5.08508655,        np.nan],
       [7.94831417,        np.nan],
       [8.11580458,        np.nan],
       [9.39001562,        np.nan],
       [       np.nan, 5.87044705],
       [3.0124633 ,        np.nan],
       [       np.nan, 1.9476429 ],
       [1.70708047,        np.nan],
       [       np.nan, 9.23379642],
       [       np.nan, 9.04880969],
       [4.38869973,        np.nan],
       [4.08719846,        np.nan],
       [       np.nan, 7.1121578 ],
       [       np.nan, 2.96675873],
       [       np.nan, 5.07858285],
       [       np.nan, 8.01014623],
       [9.28854139,        np.nan],
       [       np.nan, 2.3728358 ],
       [       np.nan, 5.46805719],
       [2.31594387,        np.nan],
       [6.79135541,        np.nan],
       [9.87982003,        np.nan],
       [9.13286828,        np.nan],
       [       np.nan, 3.3535684 ],
       [       np.nan, 7.21227499],
       [6.53757349,        np.nan],
       [7.15037078,        np.nan],
       [3.34163053,        np.nan],
       [0.30540946,        np.nan],
       [4.79922141,        np.nan],
       [6.1766639 ,        np.nan],
       [5.76721516,        np.nan],
       [       np.nan, 0.28674152],
       [       np.nan, 9.7868065 ],
       [       np.nan, 4.71088375],
       [6.81971904,        np.nan],
       [       np.nan, 0.96730026],
       [8.17547092,        np.nan],
       [       np.nan, 5.18594943],
       [       np.nan, 8.00330575],
       [4.32391504,        np.nan],
       [       np.nan, 1.73388613],
       [8.31379743,        np.nan],
       [       np.nan, 5.26875831],
       [6.56859891,        np.nan],
       [4.3165117 ,        np.nan],
       [       np.nan, 1.06216345],
       [1.98118403,        np.nan],
       [       np.nan, 9.2033204 ],
       [7.37858096,        np.nan],
       [5.47870901,        np.nan],
       [9.83052466,        np.nan],
       [       np.nan, 5.39126465],
       [       np.nan, 1.78132454],
       [9.99080395,        np.nan],
       [5.61199793,        np.nan],
       [       np.nan, 3.68916546],
       [9.81637951,        np.nan],
       [       np.nan, 3.7627221 ],
       [4.28252993,        np.nan],
       [       np.nan, 2.2618768 ],
       [5.82986383,        np.nan],
       [       np.nan, 2.6528091 ],
       [       np.nan, 7.30248792],
       [       np.nan, 1.07769015],
       [       np.nan, 8.17760559]])
data_completed = impute.IterativeImputer().fit_transform(mydata)

Answer 3 · 2019-02-15T16:28:37.000Z

Ah I think I know what's going on. IterativeImputer tries to impute missing values in each column/feature from known values in all the other columns. What you have there is a particular example where that's impossible - there is not a single row that has both columns non-nan. So there is no way to learn a model from X1_known to X2_missing and vice versa. If you, for example, change the first row to [1.5 , 1.5] it will suddenly start doing something. As is, it's just giving you the initial imputation and then failing to iteratively impute further.

So this isn't a bug. I'll close the issue, but feel free to chat further if you'd like.

Answer 4 · 2019-02-15T16:37:32.000Z

Does that mean that iterative imputation is not able to handle datasets in which per row each value (but one) is missing?

The same limitation seems to be the case for KNN and SoftImpute as well.

In fancyimpute version 0.1.0, these kind of datasets could be handeled by all of these - why, what changed? (This does not seem like an improvement...)

Answer 5 · 2019-02-15T16:43:13.000Z

I am not sure why an earlier version was able to handle it. It doesn't make sense for it to have when thinking about how IterativeImputer works. I probably won't make time to investigate this oddity. By the way, there is now a more developed, bugless version of IterativeImputer in the scikit-learn master. I suspect it will behave the same as fancyimpute's but you should give it a try.

Does that mean that iterative imputation is not able to handle datasets in which per row each value (but one) is missing? I think that's true. It needs some examples (rows) where more than one sample is NOT missing so it can learn a function between them.

The kind of dataset you describe will probably be very hard to impute in any case. I can't think of a way to do it at all that's reasonable.

Answer 6 · 2019-02-15T19:33:02.000Z

I think I understand what you mean.

Answer 7 · 2019-02-15T20:52:33.000Z

After looking into it some more I have doubt in your answer after all:

In the video https://www.youtube.com/watch?v=zX-pacwVyvU I noticed that the first missing values are first imputed with an initial impute. The following iterative linear models are fit on those. Thus it should not matter that all but one value is missing for each row in my dataset.

Then I noticed that in both the old version and the new version, it is possible to say what this initial impute is supposed to be (parameter initial_strategy for the current, init_fill_method for the old version).

Answer 8 · 2019-02-15T21:18:16.000Z

Yes, it's true that there is an initial imputation. But it imputes a constant value for all of them. Then when you do PREDICT, you end up with the exact same output because the inputs are all the result of the initial imputation (constant).

…

On Fri, Feb 15, 2019 at 12:52 PM Make42 ***@***.***> wrote: After looking into it some more I have doubt in your answer after all: In the video https://www.youtube.com/watch?v=zX-pacwVyvU I noticed that the first missing values are first imputed with an initial impute. The following iterative linear models are fit on those. Thus it should not matter that all but one value is missing for each row in my dataset. Then I noticed that in both the old version and the new version, it is possible to say what this initial impute is supposed to be (parameter initial_strategy for the current, init_fill_method for the old version). Finally I noticed that if I use a pandas DataFrame as mydata (as I did before) I do not get results. While the code in my question uses a numpy array, the code in my IDE actually uses a pandas DataFrame instead. With this DataFrame I do not get a properly filled new dataset. However, if I use a numpy array, I do get a properly filled dataset. My legacy code for the old fancyimpute version did indeed use the .values member of the respective DataFrame, which I missed when transporting my legacy code to the new experiments. Thus I guess the answer should actually be that there is no issue with the algorithm, but that I should simply make sure to use a numpy array instead of a DataFrame. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#92 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7E02fW07jZMw-nk9NoFUYiPrqVXnks5vNx4RgaJpZM4a7z-r> .

Answer 9 · 2019-02-16T09:53:08.000Z

Ah, because of the same value for each of outputs (the mean), I get a regression without a slope and only an intercept for the models and thus it stays this way?

Shouldn't we be able to rectify this by initializing with a random value instead of the mean? Possibly with a random value drawn from a normal distribution with mean of the mean of the respective variable and variance with the variance of the respective variable.

In fancyimpute 0.1.0 MICE there is a mechanic that chooses a random subset of predictor variables instead of using all the other variables to fit the Bayesian ridge regression model, but this is not the default I have been using and would not make sense for my 2D dataset. However, there is also the default mechanic to not use the prediction of the linear regression model directly for the imputation, but use the mean and variance of the Bayesian ridge regression as the mean and variance of normal distribution from which we draw a value. This value is our imputation.

Consecutive models (in the iterative imputation scheme) are build on the respective previously imputed values instead of the initial mean initialization.

Finally all the imputed values of all these models are mean averaged to result in the final imputed values.

The new fancyimpute seems to go the same way, only that instead of averaging in the end, simply the last imputation is used. Another important difference is that the mean (instead of the normal draw) is used as imputation value as a default. To get the behavior of fancyimpute 0.1.0, we need to set the parameter sample_posterior=True.

Using sample_posterior=True, I do not get the same values (the means) as the imputed values anymore as mentioned in the question.

This sample_posterior=True behavior is similar to what I suggested above, only that the random draw is done during the iterations, not for the initialization.

Answer 10 · 2019-02-16T16:49:24.000Z

Yes. I think so.

…

On Sat, Feb 16, 2019, 1:53 AM Make42 ***@***.*** wrote: Ah, because of the same value for each of outputs (the mean), I get a regression without a slope and only an intercept for the models and thus it stays this way? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#92 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7DwY90hHQoJr28ep1TRTFjjjCZvGks5vN9UEgaJpZM4a7z-r> .