Using dask with clustering algorithms not included in dask_ml
Closed this issue · 7 comments
I'm trying to use dask to run the sklearn's dbscan across multiple cores, but running into issues. This is not an algorithm included in dask_ml, so I have been trying to call it as follows (similar to how it is used here: dask/dask-tutorial#80):
client = Client(processes=False, n_workers=5)
model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time)
with parallel_backend('dask'):
model.fit(X)
where distance_sphere_and_time
is a custom metric.
Despite calling multiple workers, when I look at the CPU utilization, it still appears as though only one core is being used. Am I doing something wrong, or is there a reason dask can not increase the core usage with dbscan?
Also, I wasn't sure if there was a specific repo that this question is better suited for, but please feel free to redirect me if there is one.
Thanks for the suggestion, Tom - I just gave that a try, and my cpu usage is still hovering right around 2%.
I guess I might have asked this prematurely. I just tried including n_jobs
in the DBSCAN constructor and removing the inputs in the Client
constructor and it appears to be working (shown below for completeness)
client = Client()
model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time, n_jobs=-1)
with parallel_backend('dask'):
model.fit(Distance)
It's at least using more of the cores (not all of them yet) and the CPU usage has risen to ~14%
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.
Did this end up working @lauren-gaiascope ? Thanks.