In this notebook, I made an exploratory data analysis and a clustering of customers using K-Means Algorithm on Kaggle's Mall Customer Segmentation dataset. Specifically, I
- analyzed the general shape and columns of the dataset,
- made univariate data analysis with 4 customer-related columns to find the distribution of age, income, gender, and spending score of customers,
- used bivariate data analysis to visualize the relationships between the 4 columns and select 2 relationships that show clusters, and
- clustered customers based on their ID, annual income, and spending score using K-Means algorithm.
Here are some of the properties of the data I looked at for my EDA.
- Gender Distribution of Customers
- Age Distribution of Customers
- Income Distribution of Customers
For the distribution of income, the boxplot further shows there are outliers. Thus I also took a look at the outlier customers.
There are actually two customers who make 137K per year, and their customer ID, gender, and age are also very similar. However, an interesting fact is that their spending scores are very different, almost on the two side of the score spectrum! This is very strange, because normally one may think that if someone has a high annual income, he will buy more things from the mall, which thus assigns the customer a high spending score.
- Spending Score Distribution of Customers
- Pairplots
From the pairplot, it shows that there seems to be clusters for Customer ID vs. Spending Score graph and Annual Income vs. Spending Score graph. Thus I am curious if the algorithm will form five clusters for these features as I predict.
Before we make the clustering, let's see the scatterplot one more time.
The graph clearly shows five clusters, which are customers with 1). low Customer ID and high Spending Score, 2). low Customer ID and low Spending Score, 3). medium Customer ID and medium Spending Score, 4). high Customer ID and high Spending Score, and 5). high Customer ID and low Spending Score.
Using K-Means with 5 clusters, the output of the algorithm is as follows:
We can see that the algorithm successfully makes the 5 clusters, although some customers with medium ID and medium Spending Score seem to be grouped with the customers with low ID and high Spending Score.
Again, let's see the scatterplot one more time.
The scatterplot also shows the five clusters. Using K-Means with 5 clusters on these two features, the output is:
Which is again meeting our expectation. Moreover, on this distribution the K-Means seem to not grouping part of the middle group with the top left customers.