Is it possible to parallelize sklearn's DBSCAN algorithm
Closed this issue · 5 comments
def meth2(X):
with parallel_backend('dask'):
model = DBSCAN(eps = 0.5, min_samples = 30)
model.fit(X)
return model
thought this would work as per the documentation, but I am getting
BufferError: Existing exports of data: object cannot be re-sized
Can you provide a reproducible example of a failure? This succeeds
import dask.array as da
import sklearn.datasets
import sklearn.cluster
from sklearn.externals import joblib
from distributed import Client
from distributed import Client
client = Client()
X, y = sklearn.datasets.make_blobs()
model = sklearn.cluster.DBSCAN(eps=0.5, min_samples=3)
with joblib.parallel_backend("dask"):
model.fit(X)
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import datetime
if __name__ == '__main__':
X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, cluster_std = 2.1)
client = Client()
now = datetime.datetime.now()
model = DBSCAN(eps = 0.5, min_samples = 30)
with parallel_backend('dask'):
model.fit(X)
print(datetime.datetime.now() - now)
Below is my output
distributed.worker - WARNING - Compute Failed
Function: <sklearn.externals.joblib._dask.Batch object at 0x7f884869b1d0>
args: (array([[ 3.12448708, -4.43752312],
[ 4.89858449, -3.96334534],
[-9.70246128, 7.82301076],
...,
[ 6.25643046, -3.93627323],
[10.77439621, -5.29284763],
[-7.0445401 , 11.64406627]]))
kwargs: {}
Exception: TimeoutError('Timeout',)
and I had to stop the program manually (with crtl + C). Am I doing it wrong !
And the code you mentioned above fails to work in windows. I tried the same code on linux it was fine. Does it have anything to do with OS too !
Worked for me
In [1]: from dask.distributed import Client
...: from sklearn.externals.joblib import parallel_backend
...: from sklearn.datasets import make_blobs
...: from sklearn.cluster import DBSCAN
...:
In [2]: import datetime
...:
In [3]: X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, c
...: luster_std = 2.1)
...:
...: client = Client()
...: now = datetime.datetime.now()
...: model = DBSCAN(eps = 0.5, min_samples = 30)
...: with parallel_backend('dask'):
...: model.fit(X)
...: print(datetime.datetime.now() - now)
...:
0:00:12.678909
You might try updating versions of scikit learn, dask, and distributed to see if that helps
Hi @mrocklin
I tried to run exactly the same code with you. But It just kept running a very long time and I shut it down. Do you think what is the problem?
Hard to say in the abstract. You might try increasing the verbosity on the sklearn estimator and looking at the logs