Keep in mind that humans constantly and automatically cluster with the sensory data we ingest every day.
You are constantly categorizing like objects and phenomena that you see, hear, smell and taste every day.
The task gets extremely difficult when you are trying to group many rows (observations) and many features (columns). The mind isn't able to automatically do the same with a data set of any meaningful size.
"Efficient partitioning of large data sets into homogeneous clusters is a fundamental problem in data mining" -Huang, 1997
Similar to many tasks of machine learning, clustering aims to mimic the mind's process, but formally and on a larger scale.
1. Initial Exploratory Data Analysis/Reporting: How many naturally occurring groups exist within the data?
a. Helpful for first pass analysis of data where the idea is to find the 'like' groups that exist in the data
b. Research question may be something vague like "We want to get a feel for the data."
c. This should be in the EDA toolkit of every analyst because it is a LESS ARBITRARY method of categorizing things than, say:
i. Cut all observations up into thirds based on one variable and call it "Low/Medium/High"
ii. Cutting things into deciles (I hate this)
iii. Look at (low spend + high frequency) vs (high spend + high frequency) vs (low spend + low frequency) vs (high spend + low frequency) customer quadrants
iv. Pure Domain knowledge alone - "I noticed that people generally above this threshold tend to behave differently from other groups"
Note that some of the traditional arbitrary ways of partitioning up a dataset are based on one or two metrics tops. It's not easy to use domain knowledge and intuition alone when you have dozens or hundreds of features.
2. Feature engineering for a model that will eventually include a the cluster labels as an independent variable
a. Interestingly, when you are assigning cluster membership to row-wise observations, you have essentially engineered a new, condensed and discrete representation of your feature set.
b. In that way, row-wise clustering actually has some of the effective characteristics of columnwise clustering. By paritioning the rows into discrete groups, you have mapped multiple columns into a single feature
Once natural categories have been detected in the data, can I then use them as my Y and predict future membership in that cluster?
Goal: I want to partition my data set into 'k' number of clusters. The clusters should be the 'most' same internally and 'most different' from each other
How do we measure internal homegenity and external separation (aka 'distance')? The goal, more formally out, is to find the means of 'k' clusters (centroids) such that we minimize the Euclidean within cluster distance:
import matplotlib.pyplot as plt
%matplotlib inline
improt numpy as np
##Creating a two dimensional feature set with obvious clusters
X = np.array([[1, 2],
[4, 6],
[2, 5 ],
[20, 25],
[20, 30],
[30, 40] ])
plt.scatter(X[:,0], X[:,1], s=200)
plt.show()
## Steps in the K-Means Algorithm As defined above, the K-Means algorithm is iterative an can be described in the following steps:
- Inintialize randomly selected means for the specified number of
$k$ clusters - Calculate the Euclidean distance between all observations to all cluster centers
- Assign each observation to the cluster who's centroid it is closest to
- Take the average of the features by cluster, these are your new cluster means
- Repeat steps 2 through 4 until cluster membership becomes stable
Check out this awesome animated simulation that brilliantly visualizes the initialization and iteration towards stable cluster membership of the algorithm.
In real world data sets we often have data that is categorical and extremely useful. Maybe the categories are a string like 'Red, 'Blue', 'Green.'
Even if you have done your ML pre-processsing and one hot encoded all your dummies into binary numeric, 1/0 columns, you still have a problem.
This problem is easy to miss because most K-Means implementations will take your integer dummy variable and not throw back an error.
Does the value of a dummy encoded '1' actually have any magnitudinal (made up word, but you feel me?) distance from '0'? Or maybe you have encoded your binaries as '-1' or '1'.
"Is the Euclidean distance between 1 and 0, between -1 and 1, or between "Red" and "Green" relevant or even calculable?" - The Great Nick Fusaro
Let's ilustrate this problem by tweaking our two dimensional array to swap in a dummy variable for one of the continuous variables
##Creating a two dimensional feature set a dummy variable
X = np.array([
[1, 1],
[4, 0],
[2, 0],
[3, 1],
[4, 0],
[8, 1] ])
plt.scatter(X[:,0], X[:,1], s=200)
plt.show()
##Creating a two dimensional feature set with two dummy variables
X = np.array([
[0, 1],
[1, 0],
[1, 1],
[0, 1],
[0, 0],
[0, 1] ])
plt.scatter(X[:,0], X[:,1], s=200)
plt.show()
The summed distances between any means in these examples have no relevancy
As (Chandola et. al., 2007) wrote, "The key characteristic of categorical data is that different values that a categorical attribute takes are not inherently ordered."
There is, however, a rich body of work across several discplines that deal with the problem of categorical similarity. The (still) well used Chi-Sq statistic is a venerable technique derived in the 1800's for probabilistically comparing categorical contingency tables.
Several of the approaches look at the mismatches and overlaps between categories. One of those is measures of overlap is called Hamming Distance, which itself uses the Kronecker Delta :
Simply put, the Kronecker Delta returns a 1 if two symbols match and a 0 if they do not.
Huang (1997) stitched together a solution that accounts for numeric data using K Means and a reversed Kronecker Delta to compute similarity (or it's inverse, which is dissimilarity):
He did it, he really did it, a reverse Kronecker:
It is almost algorithmically identical to traditional K-Means, except now you are maximizing within cluster similarity that includes a chunk that measures the 'categorical lean' of each cluster.
The TL;DR of this is that people who should know better abuse K-Means by throwing categorical variables in the mix as one hot encoded binary fields. People who do know better often are faced with a choice to follow best practices and drop those variables from the K-Means algorithm, losing valuable information. K-Prototypes offers an alternate approach that allows you to capture both types of information.
In terms of data science projects, my experience with this followed a familiar pattern:
- Hmm I want to do that but I can't or shouldn't. Maybe someone else has ran into this problem.
- Days of reading possible solutions
- Days of implementing possible solutions
- Repeat steps 3 and 4 until convergence or vacation
- Present the findings in a business context and leave out most of the technical stuff that is really the most interesting part to me
Niko DeVos created a Python implementation of both K-Modes (categorical clustering only) and K-Prototypes, which will be detailed in Part II, when I go over an applied example of K-Prototypes.
https://pythonprogramming.net/k-means-from-scratch-machine-learning-tutorial/
https://pdfs.semanticscholar.org/d42b/b5ad2d03be6d8fefa63d25d02c0711d19728.pdf
https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how-to/cluster-k-means/interpret-the-results/all-statistics-and-graphs/
https://www.researchgate.net/publication/267680144_Similarity_Measures_for_Categorical_Data-A_Comparative_Study_Similarity_Measures_for_Categorical_Data-A_Comparative_Study
https://github.com/nicodv/kmodes#huang97