iskandr/fancyimpute

IterativeImputer: Input contains NaN, infinity or a value too large for dtype('float64').

joshuakoh1 opened this issue · 13 comments

Not having this issue trying to impute the same dataset with KNN.

ValueError Traceback (most recent call last)
in
1 #KNN(k=3).fit_transform(training_nan)
----> 2 IterativeImputer().fit_transform(training_nan)

~\Anaconda3\lib\site-packages\fancyimpute\iterative_imputer.py in fit_transform(self, X, y)
936 Xt, predictor = self._impute_one_feature(
937 Xt, mask_missing_values, feat_idx, neighbor_feat_idx,
--> 938 predictor=None, fit_mode=True)
939 predictor_triplet = ImputerTriplet(feat_idx,
940 neighbor_feat_idx,

~\Anaconda3\lib\site-packages\fancyimpute\iterative_imputer.py in _impute_one_feature(self, X_filled, mask_missing_values, feat_idx, neighbor_feat_idx, predictor, fit_mode)
674 y_train = safe_indexing(X_filled[:, feat_idx],
675 ~missing_row_mask)
--> 676 predictor.fit(X_train, y_train)
677
678 # get posterior samples

~\Anaconda3\lib\site-packages\sklearn\linear_model\ridge.py in fit(self, X, y, sample_weight)
1146 gcv_mode=self.gcv_mode,
1147 store_cv_values=self.store_cv_values)
-> 1148 estimator.fit(X, y, sample_weight=sample_weight)
1149 self.alpha_ = estimator.alpha_
1150 if self.store_cv_values:

~\Anaconda3\lib\site-packages\sklearn\linear_model\ridge.py in fit(self, X, y, sample_weight)
1016 """
1017 X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
-> 1018 multi_output=True, y_numeric=True)
1019 if sample_weight is not None and not isinstance(sample_weight, float):
1020 sample_weight = check_array(sample_weight, ensure_2d=False)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
754 ensure_min_features=ensure_min_features,
755 warn_on_dtype=warn_on_dtype,
--> 756 estimator=estimator)
757 if multi_output:
758 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
571 if force_all_finite:
572 _assert_all_finite(array,
--> 573 allow_nan=force_all_finite == 'allow-nan')
574
575 shape_repr = _shape_repr(array.shape)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57
58

Please provide a self-contained example that triggers this bug, including a minimal example dataset defined inline.

I've done some investigation on my end, it seems that there is nothing wrong with specific rows of the dataset so I have no idea which subsets of the dataset I should provide. My dataset is of shape (4879, 67).

Here's what I'm running so far
No error:
IterativeImputer().fit_transform(training_nan[0:410])
IterativeImputer().fit_transform(training_nan[400:411])
IterativeImputer().fit_transform(training_nan[400:450])
IterativeImputer().fit_transform(training_nan[405:500])
IterativeImputer().fit_transform(training_nan[400:600])

Error:
IterativeImputer().fit_transform(training_nan[0:411])
IterativeImputer().fit_transform(training_nan[300:411])
IterativeImputer().fit_transform(training_nan[400:500])
IterativeImputer().fit_transform(training_nan[400:700])
IterativeImputer().fit_transform(training_nan[400:800])

Further experimentation shows that passing the dataset through standardscaler before imputing removes the error so it might be an error with the range of the variables?
i.e. this works:
IterativeImputer().fit_transform(StandardScaler().fit_transform(training_nan))

Weird that IterativeImputer().fit_transform(training_nan[0:410]) is fine but IterativeImputer().fit_transform(training_nan[0:411]) isn't.

Can you find a 10 row snippet that errors out and paste the actual data so I can try to reproduce on my end?

The smallest snippet I've found that errors out is about 150 rows

Maybe make a gist with that pasted in?

Here's the shortest subset I managed to find. It appears that if I drop the first column I can add more rows before it errors out so it seems to be a calculation size issue since I can impute the entire dataset if I scale the values before

https://pastebin.com/yK3vLHCD

Is it normal practice to scale before imputing? Also, I've seen many articles recommending MICE but it seems like it's been replaced with IterativeImputer. What is the equivalent for this code?

from fancyimpute import MICE
train_cols = list(train)
train = pd.DataFrame(MICE(verbose=False).complete(train))
train.columns = train_cols

I downloaded your data from pastebin and executed this code:

import pandas as pd
from fancyimpute import IterativeImputer

df = pd.read_csv('yK3vLHCD.txt', header=None) # name of the pastebin file
X = df.values
Xt = IterativeImputer().fit_transform(X)

It worked without issue. Are you sure you have the latest version?

As to your other questions:

(1) MICE can be easily built out of IterativeImputer instances. For an example, see this in-progress sklearn example: https://github.com/scikit-learn/scikit-learn/blob/2a95bd40006f15df9ba6537678067cd294f5832e/examples/impute/plot_multiple_imputation.py#L295

(2) Yes, it's fairly standard to preprocess data if the columns/features are very different. But it's not necessary.

I downloaded your data from pastebin and executed this code:

import pandas as pd
from fancyimpute import IterativeImputer

df = pd.read_csv('yK3vLHCD.txt', header=None) # name of the pastebin file
X = df.values
Xt = IterativeImputer().fit_transform(X)

It worked without issue. Are you sure you have the latest version?

I'm running into problems with the exact code you posted:
lib\site-packages\sklearn\linear_model\ridge.py:971: RuntimeWarning: overflow encountered in square
v = s ** 2

pip show fancyimpute
Name: fancyimpute
Version: 0.4.2