dask/dask-tutorial

Is it possible to parallelize sklearn's DBSCAN algorithm

Closed this issue · 5 comments

def meth2(X):
    with parallel_backend('dask'):
        model = DBSCAN(eps = 0.5, min_samples = 30)
        model.fit(X)
    return model

thought this would work as per the documentation, but I am getting

BufferError: Existing exports of data: object cannot be re-sized

dask/dask-ml#158

Can you provide a reproducible example of a failure? This succeeds

import dask.array as da
import sklearn.datasets
import sklearn.cluster
from sklearn.externals import joblib
from distributed import Client

from distributed import Client
client = Client()

X, y = sklearn.datasets.make_blobs()

model = sklearn.cluster.DBSCAN(eps=0.5, min_samples=3)

with joblib.parallel_backend("dask"):
    model.fit(X)
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import datetime

if __name__ == '__main__':
    X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, cluster_std = 2.1)
    
    client = Client()
    now = datetime.datetime.now()
    model = DBSCAN(eps = 0.5, min_samples = 30)
    with parallel_backend('dask'):
        model.fit(X)
    print(datetime.datetime.now() - now)

Below is my output

distributed.worker - WARNING - Compute Failed
Function: <sklearn.externals.joblib._dask.Batch object at 0x7f884869b1d0>
args: (array([[ 3.12448708, -4.43752312],
[ 4.89858449, -3.96334534],
[-9.70246128, 7.82301076],
...,
[ 6.25643046, -3.93627323],
[10.77439621, -5.29284763],
[-7.0445401 , 11.64406627]]))
kwargs: {}
Exception: TimeoutError('Timeout',)

and I had to stop the program manually (with crtl + C). Am I doing it wrong !
And the code you mentioned above fails to work in windows. I tried the same code on linux it was fine. Does it have anything to do with OS too !

Worked for me

In [1]: from dask.distributed import Client
   ...: from sklearn.externals.joblib import parallel_backend
   ...: from sklearn.datasets import make_blobs
   ...: from sklearn.cluster import DBSCAN
   ...: 

In [2]: import datetime
   ...: 

In [3]:     X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, c
   ...: luster_std = 2.1)
   ...:     
   ...:     client = Client()
   ...:     now = datetime.datetime.now()
   ...:     model = DBSCAN(eps = 0.5, min_samples = 30)
   ...:     with parallel_backend('dask'):
   ...:         model.fit(X)
   ...:     print(datetime.datetime.now() - now)
   ...: 
0:00:12.678909

You might try updating versions of scikit learn, dask, and distributed to see if that helps

Hi @mrocklin
I tried to run exactly the same code with you. But It just kept running a very long time and I shut it down. Do you think what is the problem?

Hard to say in the abstract. You might try increasing the verbosity on the sklearn estimator and looking at the logs