This project was developed in the scope of the undergraduate course algorithmic problem solving of the NKUA(2018-2019). It includes 12 different implementations of clustering techniques from simple K-means to LSH/Hypercube clustering.
Each of the implemented algorithms includes 3 different steps. Initialization, Assignment and Update.
We provide hereby all the different combinations for the algorithmic implementation of each step:
- Random selection of k points (simplest)
- K-means++
- Lloyd’s assignment
- Assignment by Range search with LSH
- Assignment by Range search with Hypercube
- K-means
- Partitioning Around Medoids (PAM) improved like Lloyd’s
In total we get 12 combinations.
A makefile is included for compilation purposes and the program can be run by executing
./cluster -i input.csv -d euclidean -c cluster.conf -o out ./cluster -i input.csv -d cosine -c cluster.conf -o out
Where input.csv are the input data files in csv format and out is the filepath to the output that the programm will produce (you just need to state a name and the rest will be taken care of by the program)
cluster.conf is the file path to the configuration file for the program (a sample is provided)
The output file will contain for each cluster the indexes of the data belonging to it
Also a silhouette implementation is included for evaluation purposes.
Implementation was concluded in early 2019
- For every point, compute distance to every centroid.
- Return (exact) nearest centroid.
- Index k centroids into data-structure, e.g. LSH hashtables or Hypercube.
- For every non-centroid point, run ANN to find nearest centroid.
- Return (approximate) nearest centroid.