Market Basket Analysis is a data mining technique used to discover associations between items in transactional data. It helps retailers understand the purchasing behavior of customers, which can be used for product placement, cross-selling, and targeted marketing.
The dataset used for this analysis is the Online Retail Data Set from the UCI Machine Learning Repository. It contains transactional data from a UK-based online retail store.
- Source: UCI Machine Learning Repository: Online Retail Data Set
- Size: 541,909 transactions
- Columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country
- Data Collection: Acquire a comprehensive dataset of retail transactions.
- Data Preprocessing: Clean and prepare the data for analysis, including handling missing values and normalizing data.
- Implementation of Mining Algorithms: Implement the Apriori and FP-Growth algorithms to extract frequent itemsets from the transaction data.
- Analysis of Association Rules: Generate and evaluate association rules from frequent itemsets, focusing on metrics like support, confidence, and lift.
- Visualization: Develop visualizations to present the findings effectively, using heatmaps, network graphs, and bar charts.
- Optimization and Testing: Optimize the parameters of the mining algorithms for better performance and accuracy, and test the system with different subsets of data to evaluate its robustness.
MarketBasketAnalysis/
├── data/
│ ├── OnlineRetail.csv
│ └── Processed_OnlineRetail.csv
├── src/
│ ├── preprocessing.py
│ ├── apriori_analysis.py
│ ├── fpgrowth_analysis.py
│ ├── analyze_rules.py
│ └── analyze_fpgrowth_rules.py
├── results/
│ ├── apriori_optimization_results.csv
│ ├── fpgrowth_optimization_results.csv
│ ├── apriori_rules.csv
│ ├── fpgrowth_rules.csv
│ ├── apriori_frequent_itemsets.csv
│ └── fpgrowth_frequent_itemsets.csv
└── README.md
To install the required dependencies, run:
pip install pandas mlxtend seaborn networkx matplotlib
- Preprocess the Data:
python src/preprocessing.py
- Run Apriori Analysis:
python src/apriori_analysis.py
- Run FP-Growth Analysis:
python src/fpgrowth_analysis.py
- Analyze the Rules Generated by Apriori:
python src/analyze_rules.py
- Analyze the Rules Generated by FP-Growth:
python src/analyze_fpgrowth_rules.py
- Histograms for Support, Confidence, Lift:
- Distribution of key metrics across all generated rules.
- Bar Charts for Top Association Rules:
- Top rules based on metrics such as lift, confidence, and support.
- Heatmaps for Co-occurrence:
- Illustrates how often items appear together.
- Network Graphs for Association Rules:
- Items are nodes, and association rules are edges, visualizing relationships between items.
python src/analyze_rules.py
First few rows of the rules:
antecedents ... zhangs_metric
0 frozenset({'6 RIBBONS RUSTIC CHARM'}) ... 0.683153
1 frozenset({"PAPER CHAIN KIT 50'S CHRISTMAS "}) ... 0.936214
2 frozenset({'60 CAKE CASES VINTAGE CHRISTMAS'}) ... 0.914919
3 frozenset({'PAPER CHAIN KIT VINTAGE CHRISTMAS'}) ... 0.948911
4 frozenset({'60 CAKE CASES VINTAGE CHRISTMAS'}) ... 0.940866
[5 rows x 10 columns]
Basic statistics of the rules:
support confidence lift
count 1283.000000 1283.000000 1283.000000
mean 0.013034 0.426733 12.838750
std 0.003414 0.168999 8.045014
min 0.010022 0.200758 2.212530
25% 0.010840 0.279964 8.145205
50% 0.011822 0.391172 11.150338
75% 0.013663 0.544133 15.220856
max 0.033748 0.947955 64.070404
- Optimized Parameters for Apriori:
- Based on runtime and the number of rules,
min_support
of 0.02 andmin_confidence
of 0.3 provide a good balance.
- Based on runtime and the number of rules,
- Optimized Parameters for FP-Growth:
- For FP-Growth,
min_support
of 0.03 andmin_confidence
of 0.4 are optimal.
- For FP-Growth,
- Comparative Performance:
- FP-Growth generally runs faster than Apriori for lower
min_support
values. - Apriori tends to generate more frequent itemsets for the same parameter values compared to FP-Growth.
- Both algorithms show that higher
min_confidence
values reduce the number of rules, but Apriori generates slightly more rules.
- FP-Growth generally runs faster than Apriori for lower
This project demonstrates the application of Market Basket Analysis using Apriori and FP-Growth algorithms. By optimizing and analyzing the rules, valuable insights into customer purchasing behavior can be derived, aiding in better decision-making for product placement, cross-selling, and targeted marketing.
-
Apriori Algorithm:
-
FP-Growth Algorithm:
-
Association Rule Learning:
If you have any issues or need any help running the code, please reach out!