-
Assessed what scenarios could use
$k$ -means -
Articulated the methodology used by
$k$ -means -
Applied KMeans from sklearn.cluster to a relevant dataset
-
Selected the appropriate number of clusters using
$k$ -means and the elbow method -
Practiced applying kmeans to an image color reduction problem
You work for the marketing department within a large company that manages a customer base. For each customer you have a record of average purchase cost and time since last purchase.
You know that if you want to retain your customers you cannot treat them the same. You can use targeted marketing ads towards groups that demonstrate different behavior, but how will you divide the customers into groups?
We are continuing to use Scikit Learn as our main library. The specific documentation for k-means can be found here.
import os
import sys
module_path = os.path.abspath(os.pardir)
if module_path not in sys.path:
sys.path.append(module_path)
# Required packages for today
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn import datasets
# Familiar packages for plotting, data manipulation, and numeric functions
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# import Alison's code for the demo clusters
from src.demo_images import *
# Have plots appear in notebook
%matplotlib inline
# Default plot params
plt.style.use('seaborn')
cmap = 'tab10'
How many groups do you see?
In classification you know what groups are in the dataset and the goal is to predict class membership accurately.
In clustering you do not know which groups are in the dataset and you are trying to identify the groups.
Clustering is often an informing step in your analysis. Once clusters are identified, one can:
- Create strategies on how to approach each group differently
- Use cluster membership as an independent variable in a predictive model
- Use the clusters as the target label in future classification models. How would you assign new data to the existing clusters?
- What do they have in common?
- What are the differences between them?
- How many groups are there in the end?
- Do you see any problems with this method?
- What do they have in common?
- What are the differences between them?
- How many groups are there in the end?
- Do you see any problems with this method?
In common:
- Green dots starts at points
- Calculates distance
- Moves dots
- Re-measures distance
- Moves dots as needed
Differences:
- Dots start in different places and groups settle in different places
Groups:
- There are four groups
Problem with this method?
- Too variable!
The steps of the KMeans algorithm are pretty straightforward:
- Initialize cluster centers.
- Calculate the distance of every point to in the data set to each cluster center, and assign each point to the closest center.
- Make new cluster centers assigned to the averge of a all points labeled to the cluster.
- Repeat until some criteria has been met (the clusters no longer move)
Sci-kit Learn documentation actually has some pretty good documentation describing the algorithm if you wish for more detail.
- This is a sample dataset.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
dummy_dat = pd.read_csv("data/xclara.txt",
header=0,
index_col=0)
dummy_dat.reset_index(inplace=True)
dummy_dat.drop('index', axis=1, inplace=True)
dummy_dat.head()
dummy_dat.tail()
dummy_dat.describe()
fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2']);
# fit a KMeans Model on the dummy data. Initialize with n_clusters = 3
model = None
Notice the init
and n_init
parameters!
# Inspect the cluster centers attribute
fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2'])
for i in range(len(model.cluster_centers_)):
ax.scatter(model.cluster_centers_[i][0],
model.cluster_centers_[i][1]);
# Use the predict method on a list of 2 x and y values
fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2'],
c='#f30303');
fig, ax = plt.subplots()
ax.scatter(dummy_dat['V1'], dummy_dat['V2'],
c= model.labels_);
labeled_df = pd.concat([dummy_dat, pd.DataFrame(model.labels_,
columns=['cluster'])], axis=1)
labeled_df.head()
Elbow method uses the sum of squared error calculated from each instance of
This is sometimes called the "inertia" of the model, and fitted sklearn inertia_
attribute.
Fewer clusters seems better, but inertia will always decrease with more clusters. Hence the idea of looking for an elbow in the plot of inertia vs.
model.inertia_
Inertia is the sum of squared distances between points and their cluster center.
# Specifying the dataset and initializing variables
X = dummy_dat
distortions = []
# Calculate SSE for different K
for k in range(2, 10):
kmeans = KMeans(n_clusters=k, random_state=301)
kmeans.fit(X)
distortions.append(kmeans.inertia_)
# Plot values of SSE
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title('Elbow curve')
ax.set_xlabel('k')
ax.plot(range(2, 10), distortions)
ax.grid(True)
a refers to the average distance between a point and all other points in that cluster.
b refers to the average distance between that same point and all other points in clusters to which it does not belong
It is calculated for each point in the dataset, then averaged across all points for one cumulative score.
The Silhouette Coefficient ranges between -1 and 1. The closer to 1, the more clearly defined are the clusters. The closer to -1, the more incorrect assignment.
Suppose:
- I have four points in a one-dimensional space: 0, 1, 9, and 10; and
- I put them into two clusters: {0, 1} and {9, 10}.
Then we would calculate the Silhouette Score as follows:
For Point 0:
$a=1$ $b=9.5$ $s(0) = \frac{9.5 - 1}{9.5} = \frac{17}{19}$
For Point 1:
$a=1$ $b=8.5$ $s(1) = \frac{8.5 - 1}{8.5} = \frac{15}{17}$
For Point 9:
$a=1$ $b=8.5$ $s(9) = \frac{8.5 - 1}{8.5} = \frac{15}{17}$
For Point 10:
$a=1$ $b=9.5$ $s(10) = \frac{9.5 - 1}{9.5} = \frac{17}{19}$
The full Silhouette Score would be the average of all of these individual scores:
# Generate silhouette coefficient for each k
X = dummy_dat
silhouette_plot = []
for k in range(2, 10):
clusters = KMeans(n_clusters=k, random_state=10)
cluster_labels = clusters.fit_predict(X)
silhouette_avg = metrics.silhouette_score(X, cluster_labels)
silhouette_plot.append(silhouette_avg)
# Plot Silhouette coefficient
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title('Silhouette coefficients over k')
ax.set_xlabel('k')
ax.set_ylabel('silhouette coefficient')
ax.plot(range(2, 10), silhouette_plot)
ax.axhline(y=np.mean(silhouette_plot), color="red", linestyle="--")
ax.grid(True)
Let's practice k-means clustering with an image of a piece of art.
# Our new clustering class
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Allows us to visualize images through matplotlib plot methods
import matplotlib.image as mpimg
# Old favorites
import pandas as pd
import numpy as np
Let's look at a colorful Miro painting with matplotlib.
fig, ax = plt.subplots(figsize=(10,10))
img = mpimg.imread('data/miro.jpg')
imgplot = ax.imshow(img)
# What is the shape of the image, and what does each component represent?
# Code here
# Let's look at one pixel
# Flatten the image so that each row represents one RGB triad
img_reshape = img.reshape()
# Check the shape
img_reshape.shape
# after clustering, we will restore the original shape
# the code below demonstrates that the original image is restored by reshaping
# to the original dimensions
fig, ax = plt.subplots(figsize=(10,10))
img = mpimg.imread('./data/miro.jpg')
restored_image = img_reshape.reshape(img.shape[0],img.shape[1], 3)
imgplot = ax.imshow(restored_image)
Now, in pairs, we will use the KMeans algorithm to reduce the number of colors in the photo.
Start by reducing the number of colors to 2. To do so we will have to pass an appropriate argument when instantianting a KMeans object. The number of clusters we initiate will determine the number of colors that the image is reduced to.
In order to visualize the groupings, we will replace the original pixel values with the cluster centers associated with the assigned label.
# Reminder of our flattened image
img_reshape.shape
# Instantiate a KMeans object with the argument n_clusters equal to 2
# code here
km = None
# Fit the km object to img_reshape
# code here
# view the assigned labels via the labels_ attribute
# code here
# view the cluster centers via the cluster_centers_ attribute
# code here
# create a list which stores the cluster center associated with each label in a list.
# The list should be 1734000 elements long
label_centers = []
for label in km.labels_:
None
# Convert list to array
centers_2 = np.array(label_centers)
# check shape is (1734000, 3)
centers_2.shape
# reshape to (1200, 1445, 3)
new_image_2 = None
new_image_2.shape
# Run the cell below to plot the new image. It should have only 2 colors
fig, ax = plt.subplots(figsize=(10,10))
imgplot = ax.imshow(new_image_2.astype(int))
Write answer here
Now, try out different numbers of clusters and see their affect on the painting.