Introduction to Clustering: $k$-means

By the end of this lecture, students will have:

Assessed what scenarios could use $k$-means
Articulated the methodology used by $k$-means
Applied KMeans from sklearn.cluster to a relevant dataset
Selected the appropriate number of clusters using $k$-means and the elbow method
Practiced applying kmeans to an image color reduction problem

Scenario

You work for the marketing department within a large company that manages a customer base. For each customer you have a record of average purchase cost and time since last purchase.
You know that if you want to retain your customers you cannot treat them the same. You can use targeted marketing ads towards groups that demonstrate different behavior, but how will you divide the customers into groups?

Part 1: Concept introduction

Import libraries and download dataset

We are continuing to use Scikit Learn as our main library. The specific documentation for k-means can be found here.

import os
import sys
module_path = os.path.abspath(os.pardir)
if module_path not in sys.path:
    sys.path.append(module_path)

# Required packages for today
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn import datasets

# Familiar packages for plotting, data manipulation, and numeric functions
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# import Alison's code for the demo clusters
from src.demo_images import *

# Have plots appear in notebook
%matplotlib inline

# Default plot params
plt.style.use('seaborn')
cmap = 'tab10'

Clustering! Finding GROUPS

How many groups do you see?

Wait - How is clustering different from classification?

In classification you know what groups are in the dataset and the goal is to predict class membership accurately.

In clustering you do not know which groups are in the dataset and you are trying to identify the groups.

So what do you do with clustering results?

Clustering is often an informing step in your analysis. Once clusters are identified, one can:

Create strategies on how to approach each group differently
Use cluster membership as an independent variable in a predictive model
Use the clusters as the target label in future classification models. How would you assign new data to the existing clusters?

Explore the algorithm with an intuitive K means approach

Observe the following four methods with a sample dataset:

Method Questions:

What do they have in common?
What are the differences between them?
How many groups are there in the end?
Do you see any problems with this method?

Method 1

Method 2

Method 3

Method 4

Review Method Questions:

What do they have in common?
What are the differences between them?
How many groups are there in the end?
Do you see any problems with this method?

In common:

Green dots starts at points
Calculates distance
Moves dots
Re-measures distance
Moves dots as needed

Differences:

Dots start in different places and groups settle in different places

Groups:

There are four groups

Problem with this method?

Too variable!

K-means algorithm, at its core, in an optimization function

Reassigns groups and adjusts centroids to...

And to...

The steps of the KMeans algorithm are pretty straightforward:

Initialize cluster centers.
Calculate the distance of every point to in the data set to each cluster center, and assign each point to the closest center.
Make new cluster centers assigned to the averge of a all points labeled to the cluster.
Repeat until some criteria has been met (the clusters no longer move)

Sci-kit Learn documentation actually has some pretty good documentation describing the algorithm if you wish for more detail.

Data for the exercise

This is a sample dataset.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

dummy_dat = pd.read_csv("data/xclara.txt",
                        header=0,
                        index_col=0)
dummy_dat.reset_index(inplace=True)
dummy_dat.drop('index', axis=1, inplace=True)

dummy_dat.head()

dummy_dat.tail()

EDA of variables

dummy_dat.describe()

fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2']);

Introduction of `KMeans` class

# fit a KMeans Model on the dummy data. Initialize with n_clusters = 3
model = None

Notice the init and n_init parameters!

# Inspect the cluster centers attribute

fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2'])
for i in range(len(model.cluster_centers_)):
    ax.scatter(model.cluster_centers_[i][0],
                model.cluster_centers_[i][1]);

# Use the predict method on a list of 2 x and y values

fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2'],
           c='#f30303');

fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2'],
           c= model.labels_);

labeled_df = pd.concat([dummy_dat, pd.DataFrame(model.labels_,
                        columns=['cluster'])], axis=1)

labeled_df.head()

Part 2: Cluster Validation: Choosing the appropriate number of $k$

Two metrics we can use: elbow method and the silhouette coefficient

Part 2A: Elbow Method

Elbow method uses the sum of squared error calculated from each instance of $k$ to find the best value of $k$.

This is sometimes called the "inertia" of the model, and fitted sklearn $k$-means models have an inertia_ attribute.

Fewer clusters seems better, but inertia will always decrease with more clusters. Hence the idea of looking for an elbow in the plot of inertia vs. $k$.

model.inertia_

Inertia is the sum of squared distances between points and their cluster center.

# Specifying the dataset and initializing variables
X = dummy_dat
distortions = []

# Calculate SSE for different K
for k in range(2, 10):
    kmeans = KMeans(n_clusters=k, random_state=301)
    kmeans.fit(X)
    distortions.append(kmeans.inertia_)

# Plot values of SSE
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title('Elbow curve')
ax.set_xlabel('k')
ax.plot(range(2, 10), distortions)
ax.grid(True)

Part 2B: Silhouette Coefficient

a refers to the average distance between a point and all other points in that cluster.

b refers to the average distance between that same point and all other points in clusters to which it does not belong

It is calculated for each point in the dataset, then averaged across all points for one cumulative score.

The Silhouette Coefficient ranges between -1 and 1. The closer to 1, the more clearly defined are the clusters. The closer to -1, the more incorrect assignment.

Suppose:

I have four points in a one-dimensional space: 0, 1, 9, and 10; and
I put them into two clusters: {0, 1} and {9, 10}.

Then we would calculate the Silhouette Score as follows:

For Point 0:

$a=1$
$b=9.5$
$s(0) = \frac{9.5 - 1}{9.5} = \frac{17}{19}$

For Point 1:

$a=1$
$b=8.5$
$s(1) = \frac{8.5 - 1}{8.5} = \frac{15}{17}$

For Point 9:

$a=1$
$b=8.5$
$s(9) = \frac{8.5 - 1}{8.5} = \frac{15}{17}$

For Point 10:

$a=1$
$b=9.5$
$s(10) = \frac{9.5 - 1}{9.5} = \frac{17}{19}$

The full Silhouette Score would be the average of all of these individual scores:

$\large s = \frac{2\left(\frac{17}{19}\right) + 2\left(\frac{15}{17}\right)}{4}$

# Generate silhouette coefficient for each k
X = dummy_dat
silhouette_plot = []
for k in range(2, 10):
    clusters = KMeans(n_clusters=k, random_state=10)
    cluster_labels = clusters.fit_predict(X)
    silhouette_avg = metrics.silhouette_score(X, cluster_labels)
    silhouette_plot.append(silhouette_avg)

# Plot Silhouette coefficient
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title('Silhouette coefficients over k')
ax.set_xlabel('k')
ax.set_ylabel('silhouette coefficient')
ax.plot(range(2, 10), silhouette_plot)
ax.axhline(y=np.mean(silhouette_plot), color="red", linestyle="--")
ax.grid(True)

Activity

Let's practice k-means clustering with an image of a piece of art.

# Our new clustering class
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
# Allows us to visualize images through matplotlib plot methods
import matplotlib.image as mpimg

# Old favorites
import pandas as pd
import numpy as np

Let's look at a colorful Miro painting with matplotlib.

fig, ax = plt.subplots(figsize=(10,10))
img = mpimg.imread('data/miro.jpg')
imgplot = ax.imshow(img)

# What is the shape of the image, and what does each component represent?

# Code here

# Let's look at one pixel

# Flatten the image so that each row represents one RGB triad
img_reshape = img.reshape()

# Check the shape
img_reshape.shape

# after clustering, we will restore the original shape
# the code below demonstrates that the original image is restored by reshaping
# to the original dimensions 

fig, ax = plt.subplots(figsize=(10,10))
img = mpimg.imread('./data/miro.jpg')
restored_image = img_reshape.reshape(img.shape[0],img.shape[1], 3)
imgplot = ax.imshow(restored_image)

In pairs: 10 minute exercise

Now, in pairs, we will use the KMeans algorithm to reduce the number of colors in the photo.

Start by reducing the number of colors to 2. To do so we will have to pass an appropriate argument when instantianting a KMeans object. The number of clusters we initiate will determine the number of colors that the image is reduced to.

In order to visualize the groupings, we will replace the original pixel values with the cluster centers associated with the assigned label.

# Reminder of our flattened image
img_reshape.shape

# Instantiate a KMeans object with the argument n_clusters equal to 2
# code here
km = None

# Fit the km object to img_reshape
# code here

# view the assigned labels via the labels_ attribute
# code here

# view the cluster centers via the cluster_centers_ attribute
# code here

# create a list which stores the cluster center associated with each label in a list.  
# The list should be 1734000 elements long

label_centers = []
for label in km.labels_:
    None

# Convert list to array
centers_2 = np.array(label_centers)

# check shape is (1734000, 3)
centers_2.shape

# reshape to (1200, 1445, 3)
new_image_2 = None
new_image_2.shape

# Run the cell below to plot the new image.  It should have only 2 colors

fig, ax = plt.subplots(figsize=(10,10))
imgplot = ax.imshow(new_image_2.astype(int))

Explain in your own words why the image looks like it does.

Write answer here

Now, try out different numbers of clusters and see their affect on the painting.

sethchart/intro_to_clustering_with_kmeans