RuntimeError: Cannot clone object LassoCV(...), as the constructor either does not set or modifies parameter precompute
chrism2671 opened this issue · 6 comments
When setting n_jobs=-1
on MultiOutputRegressor
from sklearn.multioutput import MultiOutputRegressor
from celer import LassoCV
m=MultiOutputRegressor(LassoCV(fit_intercept=False,cv=10,tol=0.1),n_jobs=-1)
m.fit(X,y)
Results in:
[/usr/local/lib/python3.10/dist-packages/joblib/parallel.py](https://3h911umxekx-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240709-060151_RC00_650580690#) in _return_or_raise(self)
761 try:
762 if self.status == TASK_ERROR:
--> 763 raise self._result
764 return self._result
765 finally:
RuntimeError: Cannot clone object LassoCV(cv=10, fit_intercept=False, tol=0.1), as the constructor either does not set or modifies parameter precompute
I have attempted to explicitly set precompute
but this doesn't help. It works as expected when n_jobs=1
on MultiOutputRegressor
.
Thanks for raising this @chrism2671, it should be fixed by #295 that I'll merge as soon as the CI is green.
If you use celer in research work, we kindly ask that you cite our papers (see the README) ; if it's in an industrial context, we'd love to hear more about your usecase. You can also have a look at our skglm package that implements many more models and penalties, in particular non convex ones that have better sparsifying properties.
Wow, that was fast! And here I was struggling to install the deps to do it myself! Thank you so much! :D
@chrism2671 this would really help us:
If you use celer in research work, we kindly ask that you cite our papers (see the README) ; if it's in an industrial context, we'd love to hear more about your usecase. You can also have a look at our skglm package that implements many more models and penalties, in particular non convex ones that have better sparsifying properties.
I'm attempting to reimplement this paper:
https://www.researchgate.net/publication/314533287_Sparse_Signals_in_the_Cross-Section_of_Returns
On my significantly reduced dataset (1 year of data, 100 columns), this takes my laptop approximately 2-3 days using sklearn, and seems to be about 10 hours using Celer. The paper using R/glmnet, and thanks a supercomputer center in its notes.
I've worked hard to try to accelerate this (even writing my own lasso in KDB/q), and experimenting with Cuda Rapids, but Celer is the fastest by far. In my case, because of the MultiOutputRegressor
, I'm doing many millions of small regressions. I do wonder if Python's inefficient multiprocessing is part of the bottleneck.
How many rows do you have in your dataset ?
For such "large n_samples, small n_features" datasets, you should have a look at our GramCD solver in skglm which is super fast : scikit-learn-contrib/skglm#229
It's the other way unfortunately, (n_samples=30, n_features=300)