A parallel implementation of K-Means algorithm in C++ and OpenMP.
The task of k-means is to divide points within a space in K groups based on their characteristics. It is based on few steps:
- generate N points;
- generate K centroids that represent K clusters;
- for each point, compute the distance between the point and all of the clusters and assign the point to the nearest cluster;
- update the centroid’s characteristics, specially its coordinates, based on the new points inside the cluster;
- repeat from 3 until reaching a maximum number of iterations or until the clusters won’t move.
The implementation covres both sequential and parallel version, in C++ and OpenMP.
Two classes are defined:
- Point.h
- 2D points chosen randomly;
x_coord
,y_coord
,cluster_id
;setter()
andgetter()
methods.
- Cluster.h
- 2D points chosen randomly at first;
x_coord
,y_coord
,size
,new_x_coord
,new_y_coord
;setter()
andgetter()
methods and others to update the coordinates of the centroids.
The file main_sequential
starts the sequential version:
- in this program, points and clusters are 2D points chosen randomly;
- the first functions initialize clusters and points;
- after that, there is a while loop where are the 2 most important functions:
compute_distance()
;update_clusters()
;
The program is mostly parallelizable in the compute_distance()
function and we can parallelize the outer for.
OpenMP was choosen in order to parallelize the computing:
- Private variables for each thread:
min_distance
andmin_index
; - Firstprivate variables for each thread:
points_size
andclusters_size
; - Shared variable: vector of points and vector of clusters;
- since the amount for computation is equal for each thread, it was chosen a static scheduling.
Here is a table that shows the speedup analysis:
Here an example of the execution of the program plotted with Gnuplot:
A copy of the report (italian) can be found here.
A copy of the presentation can be found here.
Licensed under the term of MIT License.