iskandr/fancyimpute

Impute error

hugo-pires opened this issue · 19 comments

Hello and congratulations

Is there any way to measure impute error, using for instance, some kind of train/test split?

Thank you

We don't have any built-in tools. The way you'd do it yourself is NaN-out take some percentage of the known values, run an imputation, and then see how accurate it is for those that you removed. It would probably be smart to add some of that as dedicated functionality, but we're not really adding anything any more. Just basic life-support.

I have already did it. Does the imputation keep the initial known values? If so, I've got an MSE of 0, using the tutorial that you show on the homepage.

Erm hard to tell without a code example. Feel free to paste something self-contained and I'll take a gander.

missing_mask = np.isnan(df.values)
masked_filled = np.ma.masked_where(missing_mask, X_filled_nnm)
masked_original = np.ma.masked_where(missing_mask, df.values)
nnm_mse = ((masked_filled - masked_original) ** 2).mean()

Huh I don't really used the np.ma stuff. Here is what I would do to be sure:

X = df.values  # has no missing values
missing_mask = np.random.randint([0, 1], shape=X.shape) # 50% are going to be missing
X_missing = X.copy()
X_missing[missing_mask] = np.nan

X_imputed = whatever_imputer_you_want().fit_transform(X_missing)
nnm_mse = ((X - X_imputed ) ** 2) / np.sum(missing_mask) # normalized mean by number of missing values

Hello again

I've changed the second line to

missing_mask = np.random.randint(0, high=2, size=X.shape) # 50% are going to be missing

due to a type error. I guess the main goal was to generate an numpy array with 0's and 1's.

Is it nnm_mse suposed to be a scalar value of the MSE of the imputed matrix vs orginal one?

Thank you

"Generate a numpy arrays with 0s and 1s" - yes.

And, yes, nnm_mse is a scalar value of the MSE of the imputed values (50% of original).

Thank you very much for your attention. The question is that nnm_mse is a numpy array. Should I make another SUM?

By the way, is there any MSE value that you consider as "good"?

Oh yeah, sorry, that's a bug on my part. It should be:

nnm_mse = np.sum((X - X_imputed ) ** 2) / np.sum(missing_mask)

MSE is problem dependent. There's no universal good or bad unfortunately.

Sorry to disturb again

Since X has nan values the np.sum((X - X_imputed ) ** 2) has too. So nnm_mse returns also a nan. Am I doing anything wrong?

PS - Why do you say X has no missing values, on code line 1? X is the matrix that I want to fill with imputed values.

Thank you

I guess I have to mask some of the initial known values in order to use them as test values.

I am using

nnm_mse = np.nansum((X - X_imputed) ** 2) / np.sum(~np.isnan(X))

Could it be a solution? I am already getting a real value.

Like this?

nnm_mse = np.nansum((X - X_imputed) ** 2) / np.sum(~np.isnan(X_missing))

No, you need just the parts of X_missing for which you have truth. I think it's

np.sum(missing_mask & ~np.isnan(X)). So places where the missing_mask was set to 1 but it wasn't originally missing.

ok. thank you

When I was using SoftImpute I've noticed that there was a MAE value on iteration output messages. Is it possible to have it on KNN, for instance?