Input matrix X containing a constant column for `LinearRegression` is more complicated than that in `scikit-learn`
belzheng opened this issue · 9 comments
When the input matrix X contains a constant column, the LinearRegression()
class in abess package makes prediction with nan instead of estimated values, which is the case of scikit-learn class LassoCV()
. One way to avoid this is that we set the parameter is_normal=False
, however, this is not the way user likes and scikit-learn works. Since I have encountered this kind of thing many times,I wonder if there is any possible that you can optimize this API. The following codes describe the case concisely:
@belzheng ,would you please paste your code here? thx!
Here is the code:
import numpy
from pyearth import Earth
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from abess import LinearRegression
numpy.random.seed(0)
m = 1000
n = 10
X = 80*numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1*numpy.random.normal(size=m)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = Earth(enable_pruning=False)
model.fit(X_train,y_train)
X_test_new = model.transform(X_test)
X_train_new = model.transform(X_train)
print(X_train_new)
rega = LinearRegression()
rega.fit(X_train_new, y_train,)
ya_pred = rega.predict(X_test_new)
print(ya_pred)
rega = LinearRegression()
rega.fit(X_train_new, y_train, is_normal=False)
ya_pred = rega.predict(X_test_new)
print(ya_pred)
#lasso
from sklearn.linear_model import LassoCV
reglasso = LassoCV()
reglasso.fit(X_train_new, y_train)
ylasso_pred = reglasso.predict(X_test_new)
print(ylasso_pred)
I think the main difference is that LassoCV
does not consider normalization. They can normalize data with sklearn.preprocessing
and drop constant in advance.
But nan is surely annoying... We may need to disable normalization when there is constant col and give a warning? (If not, just simply disable normalization by default?)
@oooo26 , yup... but if we pose a warning to users, we have to check constant cols in advance.
But if it is not time expensive, I think it is OK.
Does is_normal
speedup abess
?
Does
is_normal
speedupabess
?
I have tested on linear/logistic and there seems no obvious difference on speed. Besides, the main algorithm is the same whether normalize or not.
If we want to check constant cols, I think pd.nunique
can help (in Python's side).
So, why we set is_normal
in our API? It is designed by @Jiang-Kangkang ?
Yes I think. And actually scikit-learn had provided normalize
at first, but it was deprecated after version 1.0 (removed after 1.2).