multithreading center init
ghamerly opened this issue · 2 comments
Migrated from BaylorCS/baylorml#2 (@BenjaminHorn)
I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187
I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.
float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
//here comes the loop
return sum_distribution;
}
But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.
Some other thing:
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}
As i can see max_dist wont be used anywhere.
The K-means++ initialization is inherently serial which is why you see only one core doing the work. At least, each major stage of the algorithm requires previous stages to complete (each iteration of https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L183).
However, the code inside that loop can be parallelized, especially the loop you mention at https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187
It would not take too much work to do this. I probably won't have time in the next few weeks to do it, but if you want to send a pull request I'd be happy to look at it and incorporate it.
One thing I have tried before and works well but is not in this code is to apply the idea of the triangle inequality to this code. This avoids distance computations on subsequent iterations and makes it much faster for large k. I may have this code lying around somewhere.
Alternatively, there are other methods based on K-means++ that supposedly work faster in parallel, see here: http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
As for the max_dist variable, I think you're right that it's unused. It's probably left over from some older profiling code.
From BaylorCS/baylorml#2 (comment) (@kno10)
It may well be feasible to use random initialization.
In my experiments, k-means++ only marginally improved runtime. It may however yield better clusters.
This depends a lot on the data; but the 8 hours initialization time may not be worth the improvement.
A simpler—yet effective—strategy for huge data is to initialize using k-means on a sample (if I'm not mistaken k-means|| is similar to running k-means++ on a distributed sample?).
I.e. randomly draw 1 million points (or even just 100k), run k-means++ and do just 3–5 iterations. Take the resulting centers as initial centers for your complete data set. Finding these centers should take maybe 10 minutes and will often yield better initial centers than k-means++? These centers may already be very good estimates.
In ELKI (not baylorml), on a 38 million 2d coordinates data set, k=100, ''on a single core and not on a dedicated benchmarking host''. For more reliable conclusions, this would need to be run on many data sets, with different ks, and many more seeds, on dedicated systems.
SortMeans with random intialization, seed 0: 420 s, 41 iterations, varsum 4.471356921340852E8
SortMeans with random intialization, seed 1: 1000 s, 102 iterations, varsum 3.494018158732119E8
SortMeans with random intialization, seed 2: 570 s, 47 iterations, varsum 3.690071545123777E8
SortMeans with k-means++ initialization, seed 0: 810 s, 77 iterations, varsum 3.7151307775374913E8
SortMeans with k-means++ initialization, seed 1: 774 s, 69 iterations, varsum 3.8855969067854965E8
SortMeans with k-means++ initialization, seed 2: 492 s, 39 iterations, varsum 3.374139162253591E8
So the main benefit of k-means++ is that it is less likely to get stuck in a bad local minimum.
If we now initialize via k-means++ on a sample, we should get good results, at hopefully fewer iterations.
SortMeans initialized with 5it,km++,1%, seed 0: 624 s, 55 iter, varsum 3.538176164640555E8
SortMeans initialized with 5it,km++,1%, seed 1: 850 s, 85 iter, varsum 4.0328181944224614E8
SortMeans initialized with 5it,km++,1%, seed 2: 593 s, 56 iter, varsum 3.590163352246089E8
And, unfortunately, reality is much more complex... The performance is not bad, but also not substantially better. We need much larger experiments to even be able to tell what is "usually" better. Overall, lucky or bad random seems to be the largest factor.
The parameters used for the "sample k-means" combination ELKI:
-time
-algorithm clustering.kmeans.KMeansSort -kmeans.k 100
-kmeans.initialization SampleKMeansInitialization -kmeans.seed 0
-kmeans.algorithm KMeansSort
-kmeans.initialization KMeansPlusPlusInitialMeans -kmeans.seed 0
-kmeans.maxiter 5 -kmeans.samplesize 0.01
-evaluator clustering.LogClusterSizes -resulthandler DiscardResultHandler
If you want to experiment with different seeds, I have contributed a seed
command in 27d970511f46321012674498d99538c37f6db651
To reproduce these sampling-based initialization in this C++ implementation, you would need to further extend the driver with e.g. a function to use precomputed centers for initialization.