Is it possible to parallelize sklearn's DBSCAN algorithm

Question

Is it possible to parallelize sklearn's DBSCAN algorithm

Closed this issue 6 years ago · 5 comments

def meth2(X):
    with parallel_backend('dask'):
        model = DBSCAN(eps = 0.5, min_samples = 30)
        model.fit(X)
    return model

thought this would work as per the documentation, but I am getting

BufferError: Existing exports of data: object cannot be re-sized

Answer 1 · 2018-10-11T20:52:02.000Z

dask/dask-ml#158

Can you provide a reproducible example of a failure? This succeeds

import dask.array as da
import sklearn.datasets
import sklearn.cluster
from sklearn.externals import joblib
from distributed import Client

from distributed import Client
client = Client()

X, y = sklearn.datasets.make_blobs()

model = sklearn.cluster.DBSCAN(eps=0.5, min_samples=3)

with joblib.parallel_backend("dask"):
    model.fit(X)

Answer 2 · 2018-10-12T07:32:45.000Z

from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import datetime

if __name__ == '__main__':
    X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, cluster_std = 2.1)
    
    client = Client()
    now = datetime.datetime.now()
    model = DBSCAN(eps = 0.5, min_samples = 30)
    with parallel_backend('dask'):
        model.fit(X)
    print(datetime.datetime.now() - now)

Below is my output

distributed.worker - WARNING - Compute Failed
Function: <sklearn.externals.joblib._dask.Batch object at 0x7f884869b1d0>
args: (array([[ 3.12448708, -4.43752312],
[ 4.89858449, -3.96334534],
[-9.70246128, 7.82301076],
...,
[ 6.25643046, -3.93627323],
[10.77439621, -5.29284763],
[-7.0445401 , 11.64406627]]))
kwargs: {}
Exception: TimeoutError('Timeout',)

and I had to stop the program manually (with crtl + C). Am I doing it wrong !
And the code you mentioned above fails to work in windows. I tried the same code on linux it was fine. Does it have anything to do with OS too !

Answer 3 · 2018-10-12T12:52:15.000Z

Worked for me

In [1]: from dask.distributed import Client
   ...: from sklearn.externals.joblib import parallel_backend
   ...: from sklearn.datasets import make_blobs
   ...: from sklearn.cluster import DBSCAN
   ...: 

In [2]: import datetime
   ...: 

In [3]:     X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, c
   ...: luster_std = 2.1)
   ...:     
   ...:     client = Client()
   ...:     now = datetime.datetime.now()
   ...:     model = DBSCAN(eps = 0.5, min_samples = 30)
   ...:     with parallel_backend('dask'):
   ...:         model.fit(X)
   ...:     print(datetime.datetime.now() - now)
   ...: 
0:00:12.678909

You might try updating versions of scikit learn, dask, and distributed to see if that helps

Answer 4 · 2019-05-09T12:33:53.000Z

Hi @mrocklin
I tried to run exactly the same code with you. But It just kept running a very long time and I shut it down. Do you think what is the problem?

Answer 5 · 2019-05-09T13:20:39.000Z

Hard to say in the abstract. You might try increasing the verbosity on the sklearn estimator and looking at the logs