Team members: Chek Hin Choi (cc2373), Weilong Guo (wg97)
The aim of this project is to explore the famous wine dataset [2] using k-means clustering and Naive Bayes classification in mlpack. In particular, we will compare 4 algorithms described in the Results section below. The data set contains result of chemical analysis of three types of wines grown in a certain plantation in Italy. The goal is to identify the three types of wine via clustering method, and evaluate the model by its prediction performance. We learnt about the C++ packages mlpack [1] and Armadillo [3].
Our main reference for k-means clustering and Naive Bayes classification can be found in the documentation of [1].
- main.cpp -- main file
- wine.data -- Wine data from [2]
- wine.names -- Wine data description from [2]
- assignments_default -- Clustering result from k-means clustering with old start
- assignments_withguess -- Clustering result from k-means clustering with warm start
- assignments_nbc -- Clustering result from Naive Bayes classification
To compile our code, we can execute
g++ main.cpp -lmlpack -O2 -larmadillo
We now describe the results that we obtained for the following 4 algorithms:
In this algorithm, we simply call the off-the-shelf default k-means clustering from mlpack without specifying the initial cluster centroids. The correct classification rate is the followng:
Class 1 | Class 2 | Class 3 | Overall |
---|---|---|---|
77.97% | 70.42% | 60.42% | 70.22% |
In this algorithm, we call the default k-means clustering from mlpack while specifying correct initial cluster centroids. The correct classification rate is the followng:
Class 1 | Class 2 | Class 3 | Overall |
---|---|---|---|
77.97% | 70.42% | 60.42% | 70.22% |
Compare with the case of cold start, we see that the k-means clustering has successfully converge and is independent of the initial cluster centroids, since both algorithms give exactly the same classification rates!
In this algorithm, we call the default k-means clustering from mlpack while specifying random initial cluster centroids. We then repeat this procedure 10,000 times and calculate the average correct classification rates, which is summarized in the table below:
Class 1 | Class 2 | Class 3 | Overall |
---|---|---|---|
72.61% | 74.46% | 63.87% | 70.99% |
Compare with the first 2 algorithms, the result suggests that the performance of cold-start (or warm-start, since they essentially yield the same result) is below that of random start.
In this algorithm, we simply call the off-the-shelf default Naive Bayes Classifier. The correct classification rate is the followng:
Class 1 | Class 2 | Class 3 | Overall |
---|---|---|---|
84.75% | 100.00% | 95.83% | 93.82% |
Comparing the above results with k-means, we see that Naive Bayes seems to outperform all the k-means algorithm.
[1]: "Mlpack"
[2]: "Wine dataset"
[3]: "Armadillo"