ghamerly/fast-kmeans

k-measn++ uniqueness check

ghamerly opened this issue · 1 comments

Migrated from BaylorCS/baylorml#3 (@BenjaminHorn)

I guess this loop's https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L217 function is to prevent a point to be chosen more then once as center.

Let's say we have 10 points in our dataset. We init the centers with with init_centers_kmeanspp_v2.

  • The first center is picked uniformly randomly: chosen_pts = {2}.
  • The 2nd point is picked in the while loop, let's say: 5. chosen_pts = {2, 5}.
  • The 3rd draw results: chosen_pts[2] = 2;

https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L218 results a unique = FALSE, we have to do the 3rd draw again.

// unique is instantiated as TRUE, outside(!) the loop
unique = unique && chosen_pts[2] != chosen_pts[0]  // TRUE && FALSE (2 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] //  FALSE && TRUE (2 !=5) = FALSE
// !FALSE -> new loop
  • 3rd draw (2nd do while loop): chosen_pts[2] = 8;
// unique is still FALSE from the previous loop
unique = unique && chosen_pts[2] != chosen_pts[0] // FALSE && TRUE(8 != 2) = FALSE
unique = unique && chosen_pts[2] != chosen_pts[1] // FALSE && TRUE(8 !=5) = FALSE
// !FALSE -> new loop

Maybe I'm getting it wrong, but with some mini datasets (where is the chance to pick the same center again is bigger) I often run into an endless loop.

I think you're right, we should set unique = true inside the do / while loop as the first thing. Do you want to make this change and send a pull request?