Using dask with clustering algorithms not included in dask_ml

Question

Using dask with clustering algorithms not included in dask_ml

Closed this issue 5 years ago · 7 comments

I'm trying to use dask to run the sklearn's dbscan across multiple cores, but running into issues. This is not an algorithm included in dask_ml, so I have been trying to call it as follows (similar to how it is used here: dask/dask-tutorial#80):

client = Client(processes=False, n_workers=5)
    model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time)
    with parallel_backend('dask'):
        model.fit(X)

where distance_sphere_and_time is a custom metric.
Despite calling multiple workers, when I look at the CPU utilization, it still appears as though only one core is being used. Am I doing something wrong, or is there a reason dask can not increase the core usage with dbscan?

Also, I wasn't sure if there was a specific repo that this question is better suited for, but please feel free to redirect me if there is one.

Answer 1 · 2019-11-18T01:23:41.000Z

The dask-ml issue tracker is probably a better spot, but no worries. Do things work if you pass n_jobs=-1 do the DBSCAN constructor?

…

On Nov 17, 2019, at 19:05, lauren-gaiascope ***@***.***> wrote: I'm trying to use dask to run the sklearn's dbscan across multiple cores, but running into issues. This is not an algorithm included in dask_ml, so I have been trying to call it as follows (similar to how it is used here: dask/dask-tutorial#80): client = Client(processes=False, n_workers=5) model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time) with parallel_backend('dask'): model.fit(X) where distance_sphere_and_time is a custom metric. Despite calling multiple workers, when I look at the CPU utilization, it still appears as though only one core is being used. Am I doing something wrong, or is there a reason dask can not increase the core usage with dbscan? Also, I wasn't sure if there was a specific repo that this question is better suited for, but please feel free to redirect me if there is one. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 2 · 2019-11-18T02:13:59.000Z

Thanks for the suggestion, Tom - I just gave that a try, and my cpu usage is still hovering right around 2%.

Answer 3 · 2019-11-18T02:24:30.000Z

I guess I might have asked this prematurely. I just tried including n_jobs in the DBSCAN constructor and removing the inputs in the Client constructor and it appears to be working (shown below for completeness)

client = Client()
    model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time, n_jobs=-1)
    with parallel_backend('dask'):
        model.fit(Distance)

It's at least using more of the cores (not all of them yet) and the CPU usage has risen to ~14%

Answer 4 · 2019-11-18T12:27:36.000Z

You should also see some activity on the Dask dashboard if it's really using the cluster.

…

On Sun, Nov 17, 2019 at 8:24 PM lauren-gaiascope ***@***.***> wrote: I guess I might have asked this prematurely. I just tried including n_jobs in the DBSCAN constructor and removing the inputs in the Client constructor and it appears to be working (shown below for completeness) client = Client() model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time, n_jobs=-1) with parallel_backend('dask'): model.fit(Distance) It's at least using more of the cores (not all of them yet) and the CPU usage has risen to ~14% — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#744?email_source=notifications&email_token=AAKAOIW2ADM5AUY4C4X35ALQUH4GBA5CNFSM4JOMYGKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEI64BQ#issuecomment-554823174>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIW5O33ATIKGLVUPQJDQUH4GBANCNFSM4JOMYGKA> .

Answer 5 · 2020-01-17T15:41:27.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 6 · 2020-01-24T16:26:50.000Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Answer 7 · 2020-11-19T06:30:32.000Z

Did this end up working @lauren-gaiascope ? Thanks.