gboeing/data-visualization

Clustering just over 500000 Geo points takes so much time and ran out of memory.

v3ss0n opened this issue · 4 comments

I am using same code as
https://github.com/gboeing/data-visualization/blob/master/location-history/google-location-history-cluster.ipynb
, with my own dataset which have over 500K points of lat and long.
It is causing out of memory ( 48 GB of RAM) and causing OS to kill the process.
I tried reducing down to 150K , still same problem.
I am using scikit-learn 0.18.1 on Linux with Anaconda + Python2.7

Use scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed. This is unnecessary and requires an enormous amount of memory.

See also: https://stackoverflow.com/a/38731876

and: scikit-learn/scikit-learn#5275

I am now trying with https://github.com/scikit-learn-contrib/hdbscan .
Are there any work around for default DBSCAN ? in latest versions?

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
saids that
This implementation bulk-computes all neighborhood queries, which increases the memory complexity to O(n.d) where d is the average number of neighbors, while original DBSCAN had memory complexity O(n).
Sparse neighborhoods can be precomputed using NearestNeighbors.radius_neighbors_graph with mode='distance'.

So precomputing will work with radius_neighbors_graph ?

Sure you could check to see if something like this works. Let me know if it does. Otherwise I just suggest using sklearn v0.15 if you need to do an ad hoc clustering job as it seems to use far less memory. A virtual environment for this would be sensible.