CS2024 Project:

Team members: Chek Hin Choi (cc2373), Weilong Guo (wg97)

Introduction

The aim of this project is to explore the famous wine dataset ^[2] using k-means clustering and Naive Bayes classification in mlpack. In particular, we will compare 4 algorithms described in the Results section below. The data set contains result of chemical analysis of three types of wines grown in a certain plantation in Italy. The goal is to identify the three types of wine via clustering method, and evaluate the model by its prediction performance. We learnt about the C++ packages mlpack ^[1] and Armadillo ^[3].

Description of algorithms

Our main reference for k-means clustering and Naive Bayes classification can be found in the documentation of ^[1].

Description of files

main.cpp -- main file
wine.data -- Wine data from ^[2]
wine.names -- Wine data description from ^[2]
assignments_default -- Clustering result from k-means clustering with old start
assignments_withguess -- Clustering result from k-means clustering with warm start
assignments_nbc -- Clustering result from Naive Bayes classification

Results

To compile our code, we can execute

g++ main.cpp -lmlpack -O2 -larmadillo

We now describe the results that we obtained for the following 4 algorithms:

1. Default k-means with cold start:

In this algorithm, we simply call the off-the-shelf default k-means clustering from mlpack without specifying the initial cluster centroids. The correct classification rate is the followng:

Class 1	Class 2	Class 3	Overall
77.97%	70.42%	60.42%	70.22%

2. Default k-means with warm start:

In this algorithm, we call the default k-means clustering from mlpack while specifying correct initial cluster centroids. The correct classification rate is the followng:

Class 1	Class 2	Class 3	Overall
77.97%	70.42%	60.42%	70.22%

Compare with the case of cold start, we see that the k-means clustering has successfully converge and is independent of the initial cluster centroids, since both algorithms give exactly the same classification rates!

3. Monte-Carlo iterations of k-means with random start:

In this algorithm, we call the default k-means clustering from mlpack while specifying random initial cluster centroids. We then repeat this procedure 10,000 times and calculate the average correct classification rates, which is summarized in the table below:

Class 1	Class 2	Class 3	Overall
72.61%	74.46%	63.87%	70.99%

Compare with the first 2 algorithms, the result suggests that the performance of cold-start (or warm-start, since they essentially yield the same result) is below that of random start.

4. Naive Bayes classification:

In this algorithm, we simply call the off-the-shelf default Naive Bayes Classifier. The correct classification rate is the followng:

Class 1	Class 2	Class 3	Overall
84.75%	100.00%	95.83%	93.82%

Comparing the above results with k-means, we see that Naive Bayes seems to outperform all the k-means algorithm.

References

^[1]: "Mlpack"

^[2]: "Wine dataset"

^[3]: "Armadillo"

mchchoi/CS2024