abess-team/abess

Input matrix X containing a constant column for `LinearRegression` is more complicated than that in `scikit-learn`

belzheng opened this issue · 9 comments

When the input matrix X contains a constant column, the LinearRegression() class in abess package makes prediction with nan instead of estimated values, which is the case of scikit-learn class LassoCV(). One way to avoid this is that we set the parameter is_normal=False, however, this is not the way user likes and scikit-learn works. Since I have encountered this kind of thing many times,I wonder if there is any possible that you can optimize this API. The following codes describe the case concisely:

1
2
3
4

@belzheng ,would you please paste your code here? thx!

@belzheng ,would you please paste your code here? thx!

Here is the code:

import numpy
from pyearth import Earth
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from abess import LinearRegression

numpy.random.seed(0)
m = 1000
n = 10
X = 80*numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1*numpy.random.normal(size=m)
X_train, X_test, y_train, y_test = train_test_split(X, y)
    
model = Earth(enable_pruning=False)
model.fit(X_train,y_train)
X_test_new = model.transform(X_test)
X_train_new = model.transform(X_train)
print(X_train_new)

rega = LinearRegression()
rega.fit(X_train_new, y_train,)
ya_pred = rega.predict(X_test_new)
print(ya_pred)

rega = LinearRegression()
rega.fit(X_train_new, y_train, is_normal=False)
ya_pred = rega.predict(X_test_new)
print(ya_pred)

#lasso
from sklearn.linear_model import LassoCV
reglasso = LassoCV()
reglasso.fit(X_train_new, y_train)
ylasso_pred = reglasso.predict(X_test_new)
print(ylasso_pred)

I think the main difference is that LassoCV does not consider normalization. They can normalize data with sklearn.preprocessing and drop constant in advance.

But nan is surely annoying... We may need to disable normalization when there is constant col and give a warning? (If not, just simply disable normalization by default?)

@oooo26 , yup... but if we pose a warning to users, we have to check constant cols in advance.

But if it is not time expensive, I think it is OK.

Does is_normal speedup abess?

Does is_normal speedup abess?

I have tested on linear/logistic and there seems no obvious difference on speed. Besides, the main algorithm is the same whether normalize or not.

If we want to check constant cols, I think pd.nunique can help (in Python's side).

So, why we set is_normal in our API? It is designed by @Jiang-Kangkang ?

Yes I think. And actually scikit-learn had provided normalize at first, but it was deprecated after version 1.0 (removed after 1.2).