Customer-Segmentation-With-RFM

The main problem this project tackled with is customer segmentation&clustering with RFM analysis via transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. The dataset can be found in: https://archive.ics.uci.edu/ml/datasets/online+retail

Before diving in different clustering algorithms, I started by data preprocessing and cleaning steps in which cancelled transactions and missing entries were dropped. Besides, there were serios outliers that may possibly distort visiualization as well as clustering algorithms to deduce biased results. For this, outliers were removed by interquantile formula of applied statistics. Second, I dealt with feature extraction and created novel features for EDA and clustering purposes: creating features of month, day and year from row date feature; total spend feature via multiplying unit price with quantity for every single transaction; most importantly I calculated recency, frequency, and monetary features for RFM analysis. RFM analysis is a marketing technique used to quantitatively rank and group customers based on the recency, frequency and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns. Because unsupervised clustering algorithms are quite sensitive to variation, I did feature scaling of RFM metrics via standardscalar of sklearn.

Having completed these steps, I conducted comprehensive EDA. In this, I did visualisations for RFM features and quantity, QQ plots of normality, wordclouds on product descriptions to understand what customers of spending more than average bought. Besides, I did wordclouds to get insights on country comparison and why sales boomed in november. What is more, I did monthly and daily(parts of the day)time series analysis to understand at which day periods (i.e. morning etc.) customers had most of their dealing. All of these processes provided interesting results and enriched the analysis. Outstanding conclusions I deduced is the dramatic decrease in transactions during sundays and striking increase in sales during mornings.

In clustering part, my input features were standardized RFM (3 features) scores for every unique customer in the dataset. Accordingly, output features were discreet label values for every customer such as 1 and 2. In this part, I used 4 distinct clustering algorithms: Agglomerative, K-Means, Mean Shift, and DBSCAN by following sckitlearn official website and domain experts on customer segmentation. As evaluation metrics that are mostly used in the domain, I calculated several metrics: Silhoutte Score, the Calinski-Harabasz index, and Davies-Bouldin Index. Performance metrics suggest that Agglemorative clustering with 4 clusters outperforms other algorithms in Silhoutte Score and Calinski-Harabasz index. However, one thing to note is that in agglomerative clustering monetary feature of RFM strongly outweighed recency and frequency features in such that clustering was implemented by single feature.

For interpretable meaningful results, K-means also performed fairly with silhoutte score of 0,50 with 3 clusters. Results of this clustering suggests three interesting customer segments that various marketing techniques can be implemented:

One segment of customers go shopping frequently but spend less. Probably they go for low priced products constantly. Another segment of customers have both low frequency of shopping and spending. Last segment on the other hand is consisted of customers spending very high but are not frequent buyers.