Customer Segmentation Based on Purchasing Behavior

Project Overview

Data Sources

Data Description

Tools

EDA Steps

Data Preprocessing Steps and Inspiration

Graphs/Visualizations

Choosing the Algorithm for the Project

Assumptions

Model Evaluation Metrics

Results

Recommendations

Limitations

Future Possibilities of the Project

References

Project Overview

The objective of this project is to analyze customer purchasing behavior to enhance strategic decision-making and operational efficiency for an online retail store. By segmenting customers based on their purchasing patterns, the project aims to provide insights for targeted marketing, personalized customer interactions, and optimized business strategies.

Data Sources

The primary dataset used for this analysis is the OnlineRetail.csv file, containing transactional data from an online retail store.

OnlineRetail.csv Dataset

Data Description

The dataset OnlineRetail.csv consists of 5,41,909 entries and includes the following columns:

InvoiceNo: Unique identifier for each transaction.
StockCode: Product identifier.
Description: Name or description of the product.
Quantity: Number of products purchased per transaction.
InvoiceDate: Date and time of the transaction.
UnitPrice: Price per unit of the product.
CustomerID: Unique identifier for each customer.
Country: Country or region where the customer resides.

Tools

Python: Data cleaning and analysis Download Python

Jupyter Notebook: For interactive data analysis and visualization Install Jupyter

Libraries

Below are the links for details and commands (if required) to install the necessary Python packages:

pandas: Go to Pandas Installation or use command: pip install pandas
numpy: Go to NumPy Installation or use command: pip install numpy
matplotlib: Go to Matplotlib Installation or use command: pip install matplotlib
seaborn: Go to Seaborn Installation or use command: pip install seaborn
scikit-learn: Go to Scikit-Learn Installation or use command: pip install scikit-learn
yellowbrick: Go to Yellowbrick Installation or use command: pip install yellowbrick`

EDA Steps

Exploratory Data Analysis (EDA) involved exploring the transactional data to answer key questions, such as:

What are the overall sales trends?
How do sales vary by country and product?
What are the peak sales periods?

Data Preprocessing Steps and Inspiration

Data Cleaning

a. Handling Missing Values: Removed records with missing CustomerID or Description. b. Removing Duplicates: Eliminated duplicate entries to ensure unique transactions. c. Standardizing Text Data: Converted product descriptions to lowercase and removed whitespace. d. Removing Outliers: Used the Interquartile Range (IQR) method to identify and remove outliers in fields like Quantity, UnitPrice, and TotalPrice.

Data Transformation

a. Standardization of Product Descriptions and Stock Codes: Mapped unique stock codes and descriptions to ensure consistency. b. Feature Engineering: Created a TotalPrice feature by multiplying Quantity by UnitPrice.

Date Handling

a. Invoice Date Conversion: Converted InvoiceDate to datetime format. b. Filtering Data by Date: Excluded transactions from incomplete periods for accurate analysis.

Graphs/Visualizations

Top 5 Countries By Sales

Reasons for Choosing the Algorithm for the Project

K-Means Clustering

Suitability for Customer Segmentation:
a. Simplicity and Efficiency: Effective for large datasets with numerical attributes. b. Scalability: Handles extensive transactional data efficiently.
Data Characteristics: a. Numerical Data Handling: Ideal for metrics like TotalPrice, Frequency, and Recency. b. Standardization Ready: Prepared data for optimal performance.
Analytical Goals: a. Customer Insights: Identifies actionable customer segments. b. RFM Analysis: Utilizes Recency, Frequency, and Monetary metrics for segmentation.

Assumptions

Data Distribution and Scale: Assumes normalized numerical data with equal variance across features.
Cluster Assumptions: Assumes spherical clusters with similar density.
Independence of Observations: Treats each transaction or customer record independently.
Algorithm-Specific Assumptions: Relies on multiple initializations for robust clustering.

Model Evaluation Metrics

Silhouette Score: Assesses cluster separation and cohesion.
Davies-Bouldin Index: Measures average similarity between clusters.
Calinski-Harabasz Index: Evaluates variance ratio between clusters.

Results

Customer Segment Distribution Analysis based on RFM:

Notable Segments:

24% in the Dormant segment.
14.90% in the Top Customers segment.
18.20% in the Faithful Customers segment.

K-means Clustering Results:

Silhouette Score: 0.565 (indicating moderate cluster separation).
Davies-Bouldin Index: 0.639 (indicating good cluster distinction). 3)Calinski-Harabasz Index: 3333.416 (indicating well-defined clusters).

Recommendations

Targeted Marketing: Use segmentation insights to tailor marketing campaigns.
Inventory Management: Optimize inventory based on purchasing trends.
Customer Engagement: Enhance engagement strategies for different segments.

Limitations

Data Quality: Potential inaccuracies due to missing or incorrect data.
Cluster Assumptions: Real-world data may not adhere to spherical clusters.
Model Sensitivity: Initial centroid placement can affect clustering results.

Future Possibilities of the Project

Integration with Predictive Analytics: Forecast future purchasing behaviors.
Dynamic Clustering: Implement real-time segmentation.
Enhanced Personalization: Develop personalized engagement strategies.

References

James, Gareth, et al. "An Introduction to Statistical Learning." Springer Texts in Statistics, 2013.
Scikit-Learn Documentation: KMeans, Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index.
"Python Data Science Handbook" by Jake VanderPlas; O’Reilly Media, 2016.

tgchacko/Customer-Segmentation---Purchasing-Behavior

Customer Segmentation Based on Purchasing Behavior

Table of Contents

Project Overview

Data Sources

Data Description

Tools

Libraries

EDA Steps

Data Preprocessing Steps and Inspiration

Data Cleaning

Data Transformation

Date Handling

Graphs/Visualizations

Top 5 Countries By Sales

Reasons for Choosing the Algorithm for the Project

K-Means Clustering

Assumptions

Model Evaluation Metrics

Results

K-means Clustering Results:

Recommendations

Limitations

Future Possibilities of the Project

References