Impute error

Question

Impute error

hugo-pires opened this issue 5 years ago · 19 comments

Hello and congratulations

Is there any way to measure impute error, using for instance, some kind of train/test split?

Thank you

Answer 1 · 2019-09-17T17:18:27.000Z

We don't have any built-in tools. The way you'd do it yourself is NaN-out take some percentage of the known values, run an imputation, and then see how accurate it is for those that you removed. It would probably be smart to add some of that as dedicated functionality, but we're not really adding anything any more. Just basic life-support.

Answer 2 · 2019-09-17T17:35:42.000Z

I have already did it. Does the imputation keep the initial known values? If so, I've got an MSE of 0, using the tutorial that you show on the homepage.

Answer 3 · 2019-09-17T17:37:41.000Z

Erm hard to tell without a code example. Feel free to paste something self-contained and I'll take a gander.

Answer 4 · 2019-09-17T18:15:52.000Z

missing_mask = np.isnan(df.values)
masked_filled = np.ma.masked_where(missing_mask, X_filled_nnm)
masked_original = np.ma.masked_where(missing_mask, df.values)
nnm_mse = ((masked_filled - masked_original) ** 2).mean()

Answer 5 · 2019-09-17T19:45:09.000Z

Huh I don't really used the np.ma stuff. Here is what I would do to be sure:

X = df.values  # has no missing values
missing_mask = np.random.randint([0, 1], shape=X.shape) # 50% are going to be missing
X_missing = X.copy()
X_missing[missing_mask] = np.nan

X_imputed = whatever_imputer_you_want().fit_transform(X_missing)
nnm_mse = ((X - X_imputed ) ** 2) / np.sum(missing_mask) # normalized mean by number of missing values

Answer 6 · 2019-09-19T15:55:47.000Z

Hello again

I've changed the second line to

missing_mask = np.random.randint(0, high=2, size=X.shape) # 50% are going to be missing

due to a type error. I guess the main goal was to generate an numpy array with 0's and 1's.

Is it nnm_mse suposed to be a scalar value of the MSE of the imputed matrix vs orginal one?

Thank you

Answer 7 · 2019-09-19T16:30:51.000Z

"Generate a numpy arrays with 0s and 1s" - yes.

And, yes, nnm_mse is a scalar value of the MSE of the imputed values (50% of original).

Answer 8 · 2019-09-19T16:37:27.000Z

Thank you very much for your attention. The question is that nnm_mse is a numpy array. Should I make another SUM?

Answer 9 · 2019-09-19T16:38:01.000Z

By the way, is there any MSE value that you consider as "good"?

Answer 10 · 2019-09-19T16:39:23.000Z

Oh yeah, sorry, that's a bug on my part. It should be:

nnm_mse = np.sum((X - X_imputed ) ** 2) / np.sum(missing_mask)

MSE is problem dependent. There's no universal good or bad unfortunately.

Answer 11 · 2019-09-20T08:13:21.000Z

Sorry to disturb again

Since X has nan values the np.sum((X - X_imputed ) ** 2) has too. So nnm_mse returns also a nan. Am I doing anything wrong?

PS - Why do you say X has no missing values, on code line 1? X is the matrix that I want to fill with imputed values.

Thank you

Answer 12 · 2019-09-20T09:05:56.000Z

I guess I have to mask some of the initial known values in order to use them as test values.

Answer 13 · 2019-09-20T09:14:55.000Z

I am using

nnm_mse = np.nansum((X - X_imputed) ** 2) / np.sum(~np.isnan(X))

Could it be a solution? I am already getting a real value.

Answer 14 · 2019-09-20T14:04:11.000Z

Yeah that's almost right. I think the numerator has to be different though. It's not just where it's not nan, it's ALSO where the mask was 1.

…

On Fri, Sep 20, 2019 at 5:14 AM Hugo Pires ***@***.***> wrote: I am using nnm_mse = np.nansum((X - X_imputed) ** 2) / np.sum(~np.isnan(X)) Could it be a solution? I am already getting a real value. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#117?email_source=notifications&email_token=AAOJV3AC5OCSPL5SWDNQ6RDQKSIBHA5CNFSM4IXNOJPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7GDIRA#issuecomment-533476420>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOJV3BRXPBYEMSOQNRIFODQKSIBHANCNFSM4IXNOJPA> .

Answer 15 · 2019-09-20T14:16:36.000Z

Like this?

nnm_mse = np.nansum((X - X_imputed) ** 2) / np.sum(~np.isnan(X_missing))

Answer 16 · 2019-09-20T14:30:51.000Z

No, you need just the parts of X_missing for which you have truth. I think it's

np.sum(missing_mask & ~np.isnan(X)). So places where the missing_mask was set to 1 but it wasn't originally missing.

Answer 17 · 2019-09-20T14:35:37.000Z

ok. thank you

Answer 18 · 2019-09-21T16:17:15.000Z

When I was using SoftImpute I've noticed that there was a MAE value on iteration output messages. Is it possible to have it on KNN, for instance?

Answer 19 · 2019-09-21T18:20:23.000Z

We are not adding more functionality to the package. It's only in basic support mode.

…

On Sat, Sep 21, 2019, 12:17 PM Hugo Pires ***@***.***> wrote: When I was using SoftImpute I've noticed that there was a MAE value on iteration output messages. Is it possible to have it on KNN, for instance? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#117?email_source=notifications&email_token=AAOJV3G6EZHBXQSGQYRQXFDQKZCIZA5CNFSM4IXNOJPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IU2QA#issuecomment-533810496>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOJV3FH464A7OPATAQ64HTQKZCIZANCNFSM4IXNOJPA> .