This project is to cluser files based on names using K-means and Jaro-Winkler algorithms. The program has been tested on Ubuntu 16.04
- CMake - See the reference
- C++11 dev tools
$ sudo apt install build-essential
- Boost
$ sudo apt-get install libboost-all-dev
$ mkdir build
$ cd build
$ cmake ..
$ make
// Run K-Means clustering with JaroWinkler distance algorithm
$ ./filecluster -f ../data/filelist.txt -o ../result/K3_iter200_result.txt -k 3 -i 200
It is hard to find a mean value among characters. Therefore, the most frequenctly appeared character in each sequence of file names is used as mean values instead.
// Resampling centeroids in line 5 in kmeans.hpp
const size_t RESAMPLING_ITER = 25;
// difference threshold to stop k-means process in line 8 in kmeans.hpp
const double DIFF_THRESHOLD = 0.01;
How to install CMake on Ubuntu 16.04
K-means cluster in Python, C++, etc
Random selection from a STL container
Jaro-Winkler algorithm