Clustering just over 500000 Geo points takes so much time and ran out of memory.

Question

Clustering just over 500000 Geo points takes so much time and ran out of memory.

v3ss0n opened this issue 8 years ago · 4 comments

I am using same code as
https://github.com/gboeing/data-visualization/blob/master/location-history/google-location-history-cluster.ipynb
, with my own dataset which have over 500K points of lat and long.
It is causing out of memory ( 48 GB of RAM) and causing OS to kill the process.
I tried reducing down to 150K , still same problem.
I am using scikit-learn 0.18.1 on Linux with Anaconda + Python2.7

Answer 1 · 2017-04-18T19:21:33.000Z

Use scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed. This is unnecessary and requires an enormous amount of memory.

See also: https://stackoverflow.com/a/38731876

and: scikit-learn/scikit-learn#5275

Answer 2 · 2017-04-18T21:06:51.000Z

I am now trying with https://github.com/scikit-learn-contrib/hdbscan .
Are there any work around for default DBSCAN ? in latest versions?

Answer 3 · 2017-04-19T09:43:40.000Z

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
saids that
This implementation bulk-computes all neighborhood queries, which increases the memory complexity to O(n.d) where d is the average number of neighbors, while original DBSCAN had memory complexity O(n).
Sparse neighborhoods can be precomputed using NearestNeighbors.radius_neighbors_graph with mode='distance'.

So precomputing will work with radius_neighbors_graph ?

Answer 4 · 2017-04-19T17:36:07.000Z

Sure you could check to see if something like this works. Let me know if it does. Otherwise I just suggest using sklearn v0.15 if you need to do an ad hoc clustering job as it seems to use far less memory. A virtual environment for this would be sensible.