k-measn++ uniqueness check
ghamerly opened this issue · 1 comments
Migrated from BaylorCS/baylorml#3 (@BenjaminHorn)
I guess this loop's https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L217 function is to prevent a point to be chosen more then once as center.
Let's say we have 10 points in our dataset. We init the centers with with init_centers_kmeanspp_v2.
- The first center is picked uniformly randomly: chosen_pts = {2}.
- The 2nd point is picked in the while loop, let's say: 5. chosen_pts = {2, 5}.
- The 3rd draw results: chosen_pts[2] = 2;
https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L218 results a unique = FALSE, we have to do the 3rd draw again.
// unique is instantiated as TRUE, outside(!) the loop
unique = unique && chosen_pts[2] != chosen_pts[0] // TRUE && FALSE (2 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] // FALSE && TRUE (2 !=5) = FALSE
// !FALSE -> new loop
- 3rd draw (2nd do while loop): chosen_pts[2] = 8;
// unique is still FALSE from the previous loop
unique = unique && chosen_pts[2] != chosen_pts[0] // FALSE && TRUE(8 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] // FALSE && TRUE(8 !=5) = FALSE
// !FALSE -> new loop
Maybe I'm getting it wrong, but with some mini datasets (where is the chance to pick the same center again is bigger) I often run into an endless loop.
I think you're right, we should set unique = true
inside the do / while loop as the first thing. Do you want to make this change and send a pull request?