mathurinm/celer

RuntimeError: Cannot clone object LassoCV(...), as the constructor either does not set or modifies parameter precompute

chrism2671 opened this issue · 6 comments

When setting n_jobs=-1 on MultiOutputRegressor

from sklearn.multioutput import MultiOutputRegressor
from celer import LassoCV
m=MultiOutputRegressor(LassoCV(fit_intercept=False,cv=10,tol=0.1),n_jobs=-1)
m.fit(X,y)

Results in:


[/usr/local/lib/python3.10/dist-packages/joblib/parallel.py](https://3h911umxekx-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240709-060151_RC00_650580690#) in _return_or_raise(self)
    761         try:
    762             if self.status == TASK_ERROR:
--> 763                 raise self._result
    764             return self._result
    765         finally:

RuntimeError: Cannot clone object LassoCV(cv=10, fit_intercept=False, tol=0.1), as the constructor either does not set or modifies parameter precompute

I have attempted to explicitly set precompute but this doesn't help. It works as expected when n_jobs=1 on MultiOutputRegressor.

Thanks for raising this @chrism2671, it should be fixed by #295 that I'll merge as soon as the CI is green.

If you use celer in research work, we kindly ask that you cite our papers (see the README) ; if it's in an industrial context, we'd love to hear more about your usecase. You can also have a look at our skglm package that implements many more models and penalties, in particular non convex ones that have better sparsifying properties.

Wow, that was fast! And here I was struggling to install the deps to do it myself! Thank you so much! :D

@chrism2671 this would really help us:

If you use celer in research work, we kindly ask that you cite our papers (see the README) ; if it's in an industrial context, we'd love to hear more about your usecase. You can also have a look at our skglm package that implements many more models and penalties, in particular non convex ones that have better sparsifying properties.

I'm attempting to reimplement this paper:

https://www.researchgate.net/publication/314533287_Sparse_Signals_in_the_Cross-Section_of_Returns

On my significantly reduced dataset (1 year of data, 100 columns), this takes my laptop approximately 2-3 days using sklearn, and seems to be about 10 hours using Celer. The paper using R/glmnet, and thanks a supercomputer center in its notes.

I've worked hard to try to accelerate this (even writing my own lasso in KDB/q), and experimenting with Cuda Rapids, but Celer is the fastest by far. In my case, because of the MultiOutputRegressor, I'm doing many millions of small regressions. I do wonder if Python's inefficient multiprocessing is part of the bottleneck.

How many rows do you have in your dataset ?
For such "large n_samples, small n_features" datasets, you should have a look at our GramCD solver in skglm which is super fast : scikit-learn-contrib/skglm#229

It's the other way unfortunately, (n_samples=30, n_features=300)