Test `GLM` with scikit-learn `check_estimator` (setting attributes during init )
titipata opened this issue · 6 comments
According to openjournals/joss-reviews#1959 (comment), we cannot use scikit-learn's check_estimator
with GLM
class. We might have set the initial values here and here somewhere outside the init.
from pyglmnet import GLM
from sklearn.utils.estimator_checks import check_estimator
check_estimator(GLM)
Output:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-10-96431abb45f5> in <module>
----> 1 check_estimator(GLM)
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_estimator(Estimator)
292 estimator = Estimator()
293 check_parameters_default_constructible(name, Estimator)
--> 294 check_no_attributes_set_in_init(name, estimator)
295 else:
296 # got an instance
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_no_attributes_set_in_init(name, estimator)
2067 "Estimator %s should not set any attribute apart"
2068 " from parameters during init. Found attributes %s."
-> 2069 % (name, sorted(invalid_attr)))
2070 # Ensure that each parameter is set in init
2071 invalid_attr = set(init_params) - set(vars(estimator)) - {"self"}
AssertionError: Estimator GLM should not set any attribute apart from parameters during init. Found attributes ['beta0_', 'beta_', 'n_iter_', 'rng', 'ynull_'].
I fear this will go down a rabbit hole, but maybe good for the long term. I'll give it a try later today and see how far I can go ...
We can also fix in the writing (paper.md
) first where we specify some functionality that can be used with some crucial scikit-learn functionality but not all.
@jasmainak I did try moving
self.beta0_ = None
self.beta_ = None
self.ynull_ = None
self.n_iter_ = 0
self.rng = np.random.RandomState(self.random_state)
into fit(...)
. However, I'm getting a following error from check_estimator
instead:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-96431abb45f5> in <module>
----> 1 check_estimator(GLM)
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_estimator(Estimator)
300 for check in _yield_all_checks(name, estimator):
301 try:
--> 302 check(name, estimator)
303 except SkipTest as exception:
304 # the only SkipTest thrown currently results from not
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/testing.py in wrapper(*args, **kwargs)
353 with warnings.catch_warnings():
354 warnings.simplefilter("ignore", self.category)
--> 355 return fn(*args, **kwargs)
356
357 return wrapper
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_fit_score_takes_y(name, estimator_orig)
1101 func = getattr(estimator, func_name, None)
1102 if func is not None:
-> 1103 func(X, y)
1104 args = [p.name for p in signature(func).parameters.values()]
1105 if args[0] == "self":
~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in fit_predict(self, X, y)
957 The predicted targets of shape (n_samples,).
958 """
--> 959 return self.fit(X, y).predict(X)
960
961 def score(self, X, y):
~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in fit(self, X, y)
786
787 if hasattr(self, '_allow_refit') and self._allow_refit is False:
--> 788 raise ValueError("This glm object has already been fit before."
789 "A refit is not allowed")
790
ValueError: This glm object has already been fit before.A refit is not allowed
note The error is from def fit_predict()
.
I think this is because of #330 . The reasoning at that time as far as I can remember was that refitting is dangerous because some estimated parameters (ending with underscore) already exist and we don't flush them out before a second fit. I think scikit-learn handles it better by using clone etc. but it seemed like an easy fix at the time. Can you try to undo the changes in that PR and see how far you can go? Feel free to open a WIP pull request
@jasmainak so last point is about scipy's expit
. After I commented out fit_predict
, I got the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-96431abb45f5> in <module>
----> 1 check_estimator(GLM)
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_estimator(Estimator)
300 for check in _yield_all_checks(name, estimator):
301 try:
--> 302 check(name, estimator)
303 except SkipTest as exception:
304 # the only SkipTest thrown currently results from not
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_complex_data(name, estimator_orig)
671 estimator = clone(estimator_orig)
672 assert_raises_regex(ValueError, "Complex data not supported",
--> 673 estimator.fit, X, y)
674
675
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/_unittest_backport.py in assertRaisesRegex(self, expected_exception, expected_regex, *args, **kwargs)
222 context = _AssertRaisesContext(expected_exception,
223 self, expected_regex)
--> 224 return context.handle('assertRaisesRegex', args, kwargs)
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/_unittest_backport.py in handle(self, name, args, kwargs)
111 self.obj_name = str(callable_obj)
112 with self:
--> 113 callable_obj(*args, **kwargs)
114 finally:
115 # bpo-23890: manually break a reference cycle
~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in fit(self, X, y)
832 alpha, self.Tau,
833 reg_lambda, X, y, self.eta,
--> 834 beta, self.fit_intercept)
835 # Update
836 beta = beta - self.learning_rate * grad
~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in _grad_L2loss(distr, alpha, Tau, reg_lambda, X, y, eta, beta, fit_intercept)
239
240 z = _z(beta[0], beta[1:], X, fit_intercept)
--> 241 mu = _mu(distr, z, eta, fit_intercept)
242 grad_mu = _grad_mu(distr, z, eta)
243
~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in _mu(distr, z, eta, fit_intercept)
100 mu = z
101 elif distr == 'binomial':
--> 102 mu = expit(z)
103 elif distr == 'probit':
104 mu = norm.cdf(z)
TypeError: ufunc 'expit' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Not sure how to fix this properly.
As @jasmainak suggested, using sklearn dependencies make GLM
works with check_estimator
(see #364). We can discard the above errors.