-- THIS ANALYSIS IS DONE IN R USING JUPYTER NOTEBOOK --
As the name suggests, this project is about dividing the customer group into different segments. The idea is to group customers who share similar characteristics. How these groups are formed are based on business objectives and the data available. Through this project we can derive insights into Customer LifeTime Value, Purchase Channel and Product proclivities, so a business can tap into the information to guide future decisions.
Customer segmentation can be achieved using a variety of customer demographics such as age, gender, marital status, etc. However, such information is not easily available. What is easily available is TRANSACTIONAL DATA (Customer Accounts, Invoices, Invoice Dates and Times, etc.) How can the customers, now be segmented?
Although it depends on the business objectives, lets use RFM (Recency, Frequency and Monetary Value) metrics to identify high value and low value customers of the business, so that they can be used for marketing purposes.
The data was obtained from UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/Online+Retail
As previously mentioned, the data did not include any demographic information of the customers, so using the new metrics to segment!
- RECENCY -- How recently has the customer made his/her purchase?
- FREQUENCY -- How frequent is the customer? How many purchases over the given time frame?
- MONETARY VALUE -- How much amount does each customer bring in?
The rule says that more or less, 80% of the results come from the 20% of the causes! In this context, 80% of sales are caused by 20% of the customers. Meaning, top 20% customers contribute most to the sales -- these are our high value customers!
This is a very hard to read, reason being our RFM variables are highly skewed!
In this project, outliers are VERY IMPORTANT ! Outliers are customers who are either high value customers or are low value customers! Both of these groups present useful information. Therefore, I will include them in the analysis!
- K-MEANS gives disjoint sets - I wanted each customer to belong to one and only one segment!
- The data set had around 541,000 customers. Therefore, time complexity could be an issue. K-means has a linear time complexity O(n) as opposed to hierarchical which has a quadratic complexity - O(n^2)!
To get the optimal number of clusters -- we can do a number of things ---
- Elbow method - Gave me 2 or 3 cluster solution
- Silhouette method - Gave me 2 cluster solution
- Gap - Statistic method - Gave me 6 cluster solution
The decision should be based upon how the business plans to use the results, and the level of granularity they want to see in the clusters. In my opinion, 4 cluster-solution should be the best, where 1 group is high value customers; 2 groups mid value customers and 1 group being the zero value/low value customers with low frequency and low revenue and who were not very recent.