Unpredictable results when some NaNs included in input
AlexeyPechnikov opened this issue · 0 comments
AlexeyPechnikov commented
NaN values can't be used to fit a model and usually should be excluded. But dask_ml allows NaNs just to return wrong output:
from sklearn.pipeline import make_pipeline
from dask_ml.preprocessing import StandardScaler
from dask_ml.linear_model import LinearRegression
X = 1.*np.array([[1, 1], [1, 2], [2, 1], [2, 2]])
y = np.array([ 6., 8., 9., np.nan])
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(X, y)
print (reg.predict(np.array([[3., 5.]])))
[10.9140625]
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(X[~dask.array.isnan(y)], y[~dask.array.isnan(y)])
print (reg.predict(np.array([[3., 5.]])))
[15.54899511]
Sure, reg.fit(X[~dask.array.isnan(y)], y[~dask.array.isnan(y)])
is the correct way but it works extremely slow on big datasets due to slow y[~dask.array.isnan(y)] or y[~np.isnan(y)]
calculation. It'd be nice to allow dask_ml ignore NaNs but the result is wrong. While dask_ml provides SimpleImputer(strategy='mean')
that's a terrible idea to use 1D mean value to fill multidimensional data gaps.
Maybe is there any approach to just exclude NaNs for dask_ml functions by scalable way?