glm-tools/pyglmnet

Test `GLM` with scikit-learn `check_estimator` (setting attributes during init )

titipata opened this issue · 6 comments

According to openjournals/joss-reviews#1959 (comment), we cannot use scikit-learn's check_estimator with GLM class. We might have set the initial values here and here somewhere outside the init.

from pyglmnet import GLM
from sklearn.utils.estimator_checks import check_estimator

check_estimator(GLM)

Output:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-96431abb45f5> in <module>
----> 1 check_estimator(GLM)

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_estimator(Estimator)
    292         estimator = Estimator()
    293         check_parameters_default_constructible(name, Estimator)
--> 294         check_no_attributes_set_in_init(name, estimator)
    295     else:
    296         # got an instance

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_no_attributes_set_in_init(name, estimator)
   2067             "Estimator %s should not set any attribute apart"
   2068             " from parameters during init. Found attributes %s."
-> 2069             % (name, sorted(invalid_attr)))
   2070     # Ensure that each parameter is set in init
   2071     invalid_attr = set(init_params) - set(vars(estimator)) - {"self"}

AssertionError: Estimator GLM should not set any attribute apart from parameters during init. Found attributes ['beta0_', 'beta_', 'n_iter_', 'rng', 'ynull_'].

I fear this will go down a rabbit hole, but maybe good for the long term. I'll give it a try later today and see how far I can go ...

We can also fix in the writing (paper.md) first where we specify some functionality that can be used with some crucial scikit-learn functionality but not all.

@jasmainak I did try moving

self.beta0_ = None
self.beta_ = None
self.ynull_ = None
self.n_iter_ = 0
self.rng = np.random.RandomState(self.random_state)

into fit(...). However, I'm getting a following error from check_estimator instead:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-96431abb45f5> in <module>
----> 1 check_estimator(GLM)

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_estimator(Estimator)
    300     for check in _yield_all_checks(name, estimator):
    301         try:
--> 302             check(name, estimator)
    303         except SkipTest as exception:
    304             # the only SkipTest thrown currently results from not

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/testing.py in wrapper(*args, **kwargs)
    353             with warnings.catch_warnings():
    354                 warnings.simplefilter("ignore", self.category)
--> 355                 return fn(*args, **kwargs)
    356 
    357         return wrapper

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_fit_score_takes_y(name, estimator_orig)
   1101         func = getattr(estimator, func_name, None)
   1102         if func is not None:
-> 1103             func(X, y)
   1104             args = [p.name for p in signature(func).parameters.values()]
   1105             if args[0] == "self":

~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in fit_predict(self, X, y)
    957             The predicted targets of shape (n_samples,).
    958         """
--> 959         return self.fit(X, y).predict(X)
    960 
    961     def score(self, X, y):

~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in fit(self, X, y)
    786 
    787         if hasattr(self, '_allow_refit') and self._allow_refit is False:
--> 788             raise ValueError("This glm object has already been fit before."
    789                              "A refit is not allowed")
    790 

ValueError: This glm object has already been fit before.A refit is not allowed

note The error is from def fit_predict().

I think this is because of #330 . The reasoning at that time as far as I can remember was that refitting is dangerous because some estimated parameters (ending with underscore) already exist and we don't flush them out before a second fit. I think scikit-learn handles it better by using clone etc. but it seemed like an easy fix at the time. Can you try to undo the changes in that PR and see how far you can go? Feel free to open a WIP pull request

@jasmainak so last point is about scipy's expit. After I commented out fit_predict, I got the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-96431abb45f5> in <module>
----> 1 check_estimator(GLM)

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_estimator(Estimator)
    300     for check in _yield_all_checks(name, estimator):
    301         try:
--> 302             check(name, estimator)
    303         except SkipTest as exception:
    304             # the only SkipTest thrown currently results from not

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/estimator_checks.py in check_complex_data(name, estimator_orig)
    671     estimator = clone(estimator_orig)
    672     assert_raises_regex(ValueError, "Complex data not supported",
--> 673                         estimator.fit, X, y)
    674 
    675 

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/_unittest_backport.py in assertRaisesRegex(self, expected_exception, expected_regex, *args, **kwargs)
    222         context = _AssertRaisesContext(expected_exception,
    223                                        self, expected_regex)
--> 224         return context.handle('assertRaisesRegex', args, kwargs)

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/_unittest_backport.py in handle(self, name, args, kwargs)
    111                 self.obj_name = str(callable_obj)
    112             with self:
--> 113                 callable_obj(*args, **kwargs)
    114         finally:
    115             # bpo-23890: manually break a reference cycle

~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in fit(self, X, y)
    832                                     alpha, self.Tau,
    833                                     reg_lambda, X, y, self.eta,
--> 834                                     beta, self.fit_intercept)
    835                 # Update
    836                 beta = beta - self.learning_rate * grad

~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in _grad_L2loss(distr, alpha, Tau, reg_lambda, X, y, eta, beta, fit_intercept)
    239 
    240     z = _z(beta[0], beta[1:], X, fit_intercept)
--> 241     mu = _mu(distr, z, eta, fit_intercept)
    242     grad_mu = _grad_mu(distr, z, eta)
    243 

~/Documents/github/pyglmnet/pyglmnet/pyglmnet.py in _mu(distr, z, eta, fit_intercept)
    100         mu = z
    101     elif distr == 'binomial':
--> 102         mu = expit(z)
    103     elif distr == 'probit':
    104         mu = norm.cdf(z)

TypeError: ufunc 'expit' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Not sure how to fix this properly.

As @jasmainak suggested, using sklearn dependencies make GLM works with check_estimator (see #364). We can discard the above errors.