Market Basket Analysis

Project Overview

Market Basket Analysis is a data mining technique used to discover associations between items in transactional data. It helps retailers understand the purchasing behavior of customers, which can be used for product placement, cross-selling, and targeted marketing.

Dataset

The dataset used for this analysis is the Online Retail Data Set from the UCI Machine Learning Repository. It contains transactional data from a UK-based online retail store.

Source: UCI Machine Learning Repository: Online Retail Data Set
Size: 541,909 transactions
Columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country

Goals

Data Collection: Acquire a comprehensive dataset of retail transactions.
Data Preprocessing: Clean and prepare the data for analysis, including handling missing values and normalizing data.
Implementation of Mining Algorithms: Implement the Apriori and FP-Growth algorithms to extract frequent itemsets from the transaction data.
Analysis of Association Rules: Generate and evaluate association rules from frequent itemsets, focusing on metrics like support, confidence, and lift.
Visualization: Develop visualizations to present the findings effectively, using heatmaps, network graphs, and bar charts.
Optimization and Testing: Optimize the parameters of the mining algorithms for better performance and accuracy, and test the system with different subsets of data to evaluate its robustness.

Project Structure

MarketBasketAnalysis/
├── data/
│ ├── OnlineRetail.csv
│ └── Processed_OnlineRetail.csv
├── src/
│ ├── preprocessing.py
│ ├── apriori_analysis.py
│ ├── fpgrowth_analysis.py
│ ├── analyze_rules.py
│ └── analyze_fpgrowth_rules.py
├── results/
│ ├── apriori_optimization_results.csv
│ ├── fpgrowth_optimization_results.csv
│ ├── apriori_rules.csv
│ ├── fpgrowth_rules.csv
│ ├── apriori_frequent_itemsets.csv
│ └── fpgrowth_frequent_itemsets.csv
└── README.md

Dependencies

To install the required dependencies, run:

pip install pandas mlxtend seaborn networkx matplotlib

How to Run the Project

Preprocess the Data:

python src/preprocessing.py

Run Apriori Analysis:

python src/apriori_analysis.py

Run FP-Growth Analysis:

python src/fpgrowth_analysis.py

Analyze the Rules Generated by Apriori:

python src/analyze_rules.py

Analyze the Rules Generated by FP-Growth:

python src/analyze_fpgrowth_rules.py

Visualizations

Histograms for Support, Confidence, Lift:

Distribution of key metrics across all generated rules.

Bar Charts for Top Association Rules:

Top rules based on metrics such as lift, confidence, and support.

Heatmaps for Co-occurrence:

Illustrates how often items appear together.

Network Graphs for Association Rules:

Items are nodes, and association rules are edges, visualizing relationships between items.

Example Output

python src/analyze_rules.py
First few rows of the rules:
antecedents ... zhangs_metric
0 frozenset({'6 RIBBONS RUSTIC CHARM'}) ... 0.683153
1 frozenset({"PAPER CHAIN KIT 50'S CHRISTMAS "}) ... 0.936214
2 frozenset({'60 CAKE CASES VINTAGE CHRISTMAS'}) ... 0.914919
3 frozenset({'PAPER CHAIN KIT VINTAGE CHRISTMAS'}) ... 0.948911
4 frozenset({'60 CAKE CASES VINTAGE CHRISTMAS'}) ... 0.940866

[5 rows x 10 columns]

Basic statistics of the rules:
support confidence lift
count 1283.000000 1283.000000 1283.000000
mean 0.013034 0.426733 12.838750
std 0.003414 0.168999 8.045014
min 0.010022 0.200758 2.212530
25% 0.010840 0.279964 8.145205
50% 0.011822 0.391172 11.150338
75% 0.013663 0.544133 15.220856
max 0.033748 0.947955 64.070404

Results and Insights

Optimized Parameters for Apriori:
- Based on runtime and the number of rules, min_support of 0.02 and min_confidence of 0.3 provide a good balance.
Optimized Parameters for FP-Growth:
- For FP-Growth, min_support of 0.03 and min_confidence of 0.4 are optimal.
Comparative Performance:
- FP-Growth generally runs faster than Apriori for lower min_support values.
- Apriori tends to generate more frequent itemsets for the same parameter values compared to FP-Growth.
- Both algorithms show that higher min_confidence values reduce the number of rules, but Apriori generates slightly more rules.

Conclusion

This project demonstrates the application of Market Basket Analysis using Apriori and FP-Growth algorithms. By optimizing and analyzing the rules, valuable insights into customer purchasing behavior can be derived, aiding in better decision-making for product placement, cross-selling, and targeted marketing.

References

Apriori Algorithm:
FP-Growth Algorithm:
Association Rule Learning:
- Wikipedia: Association Rule Learning
- A comprehensive guide to association rule mining

If you have any issues or need any help running the code, please reach out!

sbangapadang22/MarketBasketAnalysis