K-means Clustering in R

Example k-means clustering analysis of red wine in R

Sample dataset on red wine samples used from UCI Machine Learning Repository.

df <- read.csv("winequality-red.csv", sep=";", header=TRUE)

head(df, n=2)

fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
1	7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4
2	7.8	0.88	0.00	2.6	0.098	25	67	0.9968	3.20	0.68	9.8

Data Preparation

To prepare the dataset for clustering, we center and scale the columns using scale(x, center = TRUE, scale = TRUE), where x is a matrix or dataframe.

df_scaled <- scale(df[-1])

head(df_scaled, n=2)

volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
0.9615758	-1.391037	-0.45307667	-0.24363047	-0.46604672	-0.3790141	0.55809987	1.2882399	-0.57902538	-0.9599458	-0.7875763
1.9668271	-1.391037	0.04340257	0.22380518	0.87236532	0.6241680	0.02825193	-0.7197081	0.12891007	-0.5845942	-0.7875763

Determine Optimal Number of Columns

wssplot <- function(data, nc=15, seed=1234){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of Clusters",
                     ylab="Within groups sum of squares")}

wssplot(df_scaled)

Deterimine optimal number of clusters

NbClust() is used to determine the best clustering scheme from the different results obtained by varying all combinations of number of clusters and distance methods.

library("NbClust")
set.seed(1234)
nc <- NbClust(df_scaled, min.nc=2, max.nc=15, method="kmeans")

In these two figures, Dindex refers to the Dunn index. Higher Dunn index values indicate better clustering.

Optimal number of clusters is determined as the number of clusters selected by the highest number of criteria. This can be visualized as a barplot:

barplot(table(nc$Best.n[1,]),
          xlab="Numer of Clusters", ylab="Number of Criteria",
          main="Number of Clusters Chosen by 26 Criteria")

K-means cluster analysis

kmeans() is used to obtain the final clustering solution.

set.seed(1234)
fit.km <- kmeans(df, 2, nstart=25)

fit.km$size returns the number of items in each cluster

fit.km$size

[1] 1179 420

fit.km$centers returns the central value for each cluster

fit.km$centers

fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
8.424258	0.5193342	0.2665394	2.394275	0.08544614	12.37193	30.34436	0.9966768	3.315522	0.6565310	10.54022	5.724343
8.025952	0.5516429	0.2834286	2.944524	0.09313810	25.70833	91.72857	0.9969427	3.298738	0.6626905	10.09389	5.388095

As the centroids are quantified using the scaled data, the aggregate() function is used with the determined cluster memberships to quantify variable means for each cluster:

aggregate(df[-1], by=list(cluster=fit.km$cluster),mean)

cluster	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
1	0.5193342	0.2665394	2.394275	0.08544614	12.37193	30.34436	0.9966768	3.315522	0.6565310	10.54022	5.724343
2	0.5516429	0.2834286	2.944524	0.09313810	25.70833	91.72857	0.9969427	3.298738	0.6626905	10.09389	5.388095

Inspired by Chapter 16 in R in Action by Robert I. Kabacoff.

trevorwitter/Clustering

K-means Clustering in R

Data Preparation

Determine Optimal Number of Columns

Deterimine optimal number of clusters

K-means cluster analysis