IterativeImputer: Input contains NaN, infinity or a value too large for dtype('float64').

Question

IterativeImputer: Input contains NaN, infinity or a value too large for dtype('float64').

joshuakoh1 opened this issue 5 years ago · 13 comments

Not having this issue trying to impute the same dataset with KNN.

ValueError Traceback (most recent call last)
in
1 #KNN(k=3).fit_transform(training_nan)
----> 2 IterativeImputer().fit_transform(training_nan)

~\Anaconda3\lib\site-packages\fancyimpute\iterative_imputer.py in fit_transform(self, X, y)
936 Xt, predictor = self._impute_one_feature(
937 Xt, mask_missing_values, feat_idx, neighbor_feat_idx,
--> 938 predictor=None, fit_mode=True)
939 predictor_triplet = ImputerTriplet(feat_idx,
940 neighbor_feat_idx,

~\Anaconda3\lib\site-packages\fancyimpute\iterative_imputer.py in _impute_one_feature(self, X_filled, mask_missing_values, feat_idx, neighbor_feat_idx, predictor, fit_mode)
674 y_train = safe_indexing(X_filled[:, feat_idx],
675 ~missing_row_mask)
--> 676 predictor.fit(X_train, y_train)
677
678 # get posterior samples

~\Anaconda3\lib\site-packages\sklearn\linear_model\ridge.py in fit(self, X, y, sample_weight)
1146 gcv_mode=self.gcv_mode,
1147 store_cv_values=self.store_cv_values)
-> 1148 estimator.fit(X, y, sample_weight=sample_weight)
1149 self.alpha_ = estimator.alpha_
1150 if self.store_cv_values:

~\Anaconda3\lib\site-packages\sklearn\linear_model\ridge.py in fit(self, X, y, sample_weight)
1016 """
1017 X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
-> 1018 multi_output=True, y_numeric=True)
1019 if sample_weight is not None and not isinstance(sample_weight, float):
1020 sample_weight = check_array(sample_weight, ensure_2d=False)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
754 ensure_min_features=ensure_min_features,
755 warn_on_dtype=warn_on_dtype,
--> 756 estimator=estimator)
757 if multi_output:
758 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
571 if force_all_finite:
572 _assert_all_finite(array,
--> 573 allow_nan=force_all_finite == 'allow-nan')
574
575 shape_repr = _shape_repr(array.shape)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57
58

Answer 1 · 2019-04-09T17:01:28.000Z

Please provide a self-contained example that triggers this bug, including a minimal example dataset defined inline.

Answer 2 · 2019-04-09T17:14:22.000Z

I've done some investigation on my end, it seems that there is nothing wrong with specific rows of the dataset so I have no idea which subsets of the dataset I should provide. My dataset is of shape (4879, 67).

Here's what I'm running so far
No error:
IterativeImputer().fit_transform(training_nan[0:410])
IterativeImputer().fit_transform(training_nan[400:411])
IterativeImputer().fit_transform(training_nan[400:450])
IterativeImputer().fit_transform(training_nan[405:500])
IterativeImputer().fit_transform(training_nan[400:600])

Error:
IterativeImputer().fit_transform(training_nan[0:411])
IterativeImputer().fit_transform(training_nan[300:411])
IterativeImputer().fit_transform(training_nan[400:500])
IterativeImputer().fit_transform(training_nan[400:700])
IterativeImputer().fit_transform(training_nan[400:800])

Answer 3 · 2019-04-09T17:19:44.000Z

Further experimentation shows that passing the dataset through standardscaler before imputing removes the error so it might be an error with the range of the variables?
i.e. this works:
IterativeImputer().fit_transform(StandardScaler().fit_transform(training_nan))

Answer 4 · 2019-04-09T17:24:41.000Z

Weird that IterativeImputer().fit_transform(training_nan[0:410]) is fine but IterativeImputer().fit_transform(training_nan[0:411]) isn't.

Can you find a 10 row snippet that errors out and paste the actual data so I can try to reproduce on my end?

Answer 5 · 2019-04-09T17:29:56.000Z

The smallest snippet I've found that errors out is about 150 rows

Answer 6 · 2019-04-09T17:30:55.000Z

Maybe make a gist with that pasted in?

Answer 7 · 2019-04-09T17:37:09.000Z

Here's the shortest subset I managed to find. It appears that if I drop the first column I can add more rows before it errors out so it seems to be a calculation size issue since I can impute the entire dataset if I scale the values before

https://pastebin.com/yK3vLHCD

Answer 8 · 2019-04-09T17:40:25.000Z

Thanks, I'll take a look later this week at some point.

…

On Tue, Apr 9, 2019 at 10:37 AM Joshua Koh ***@***.***> wrote: Here's the shortest subset I managed to find. It appears that if I drop the first column I can add more rows before it errors out. https://pastebin.com/yK3vLHCD — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#96 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7If-fVs_UNxWRLUz1NDOzcaAoJJPks5vfM_FgaJpZM4ck-uf> .

Answer 9 · 2019-04-09T17:44:38.000Z

Is it normal practice to scale before imputing? Also, I've seen many articles recommending MICE but it seems like it's been replaced with IterativeImputer. What is the equivalent for this code?

from fancyimpute import MICE
train_cols = list(train)
train = pd.DataFrame(MICE(verbose=False).complete(train))
train.columns = train_cols

Answer 10 · 2019-04-10T00:38:44.000Z

I downloaded your data from pastebin and executed this code:

import pandas as pd
from fancyimpute import IterativeImputer

df = pd.read_csv('yK3vLHCD.txt', header=None) # name of the pastebin file
X = df.values
Xt = IterativeImputer().fit_transform(X)

It worked without issue. Are you sure you have the latest version?

Answer 11 · 2019-04-10T00:42:54.000Z

As to your other questions:

(1) MICE can be easily built out of IterativeImputer instances. For an example, see this in-progress sklearn example: https://github.com/scikit-learn/scikit-learn/blob/2a95bd40006f15df9ba6537678067cd294f5832e/examples/impute/plot_multiple_imputation.py#L295

(2) Yes, it's fairly standard to preprocess data if the columns/features are very different. But it's not necessary.

Answer 12 · 2019-04-10T08:21:43.000Z

I downloaded your data from pastebin and executed this code:
import pandas as pd
from fancyimpute import IterativeImputer

df = pd.read_csv('yK3vLHCD.txt', header=None) # name of the pastebin file
X = df.values
Xt = IterativeImputer().fit_transform(X)
It worked without issue. Are you sure you have the latest version?

I'm running into problems with the exact code you posted:
lib\site-packages\sklearn\linear_model\ridge.py:971: RuntimeWarning: overflow encountered in square
v = s ** 2

pip show fancyimpute
Name: fancyimpute
Version: 0.4.2

Answer 13 · 2019-04-10T15:11:35.000Z

OK I think I know what's going on. The maximum value of your data is 2590000. When I impute, the maximum imputed value is 115454960165484, which is massive. So you're just overflowing. That's why scaling before imputing solves the problem entirely. I am not sure why it's overflowing for you but not for me, but I think it's safe to just divide every column by its maximum absolute value and you shouldn't have any further issues.

…

On Wed, Apr 10, 2019 at 1:21 AM Joshua Koh ***@***.***> wrote: I downloaded your data from pastebin and executed this code: import pandas as pd from fancyimpute import IterativeImputer df = pd.read_csv('yK3vLHCD.txt', header=None) # name of the pastebin file X = df.values Xt = IterativeImputer().fit_transform(X) It worked without issue. Are you sure you have the latest version? I'm running into problems with the exact code you posted: lib\site-packages\sklearn\linear_model\ridge.py:971: RuntimeWarning: overflow encountered in square v = s ** 2 pip show fancyimpute Name: fancyimpute Version: 0.4.2 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#96 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7N8yzVEnHbBcEXVmWQ2MJ5prz_Aeks5vfZ8XgaJpZM4ck-uf> .