/MarketBasketAnalysis

Data mining to discover associations throughout large datasets.

Primary LanguagePython

Market Basket Analysis

Project Overview

Market Basket Analysis is a data mining technique used to discover associations between items in transactional data. It helps retailers understand the purchasing behavior of customers, which can be used for product placement, cross-selling, and targeted marketing.

Dataset

The dataset used for this analysis is the Online Retail Data Set from the UCI Machine Learning Repository. It contains transactional data from a UK-based online retail store.

Goals

  1. Data Collection: Acquire a comprehensive dataset of retail transactions.
  2. Data Preprocessing: Clean and prepare the data for analysis, including handling missing values and normalizing data.
  3. Implementation of Mining Algorithms: Implement the Apriori and FP-Growth algorithms to extract frequent itemsets from the transaction data.
  4. Analysis of Association Rules: Generate and evaluate association rules from frequent itemsets, focusing on metrics like support, confidence, and lift.
  5. Visualization: Develop visualizations to present the findings effectively, using heatmaps, network graphs, and bar charts.
  6. Optimization and Testing: Optimize the parameters of the mining algorithms for better performance and accuracy, and test the system with different subsets of data to evaluate its robustness.

Project Structure

MarketBasketAnalysis/
├── data/
│ ├── OnlineRetail.csv
│ └── Processed_OnlineRetail.csv
├── src/
│ ├── preprocessing.py
│ ├── apriori_analysis.py
│ ├── fpgrowth_analysis.py
│ ├── analyze_rules.py
│ └── analyze_fpgrowth_rules.py
├── results/
│ ├── apriori_optimization_results.csv
│ ├── fpgrowth_optimization_results.csv
│ ├── apriori_rules.csv
│ ├── fpgrowth_rules.csv
│ ├── apriori_frequent_itemsets.csv
│ └── fpgrowth_frequent_itemsets.csv
└── README.md

Dependencies

To install the required dependencies, run:

pip install pandas mlxtend seaborn networkx matplotlib

How to Run the Project

  1. Preprocess the Data:
python src/preprocessing.py
  1. Run Apriori Analysis:
python src/apriori_analysis.py
  1. Run FP-Growth Analysis:
python src/fpgrowth_analysis.py
  1. Analyze the Rules Generated by Apriori:
python src/analyze_rules.py
  1. Analyze the Rules Generated by FP-Growth:
python src/analyze_fpgrowth_rules.py

Visualizations

  1. Histograms for Support, Confidence, Lift:
  • Distribution of key metrics across all generated rules.
  1. Bar Charts for Top Association Rules:
  • Top rules based on metrics such as lift, confidence, and support.
  1. Heatmaps for Co-occurrence:
  • Illustrates how often items appear together.
  1. Network Graphs for Association Rules:
  • Items are nodes, and association rules are edges, visualizing relationships between items.

Example Output

python src/analyze_rules.py
First few rows of the rules:
antecedents ... zhangs_metric
0 frozenset({'6 RIBBONS RUSTIC CHARM'}) ... 0.683153
1 frozenset({"PAPER CHAIN KIT 50'S CHRISTMAS "}) ... 0.936214
2 frozenset({'60 CAKE CASES VINTAGE CHRISTMAS'}) ... 0.914919
3 frozenset({'PAPER CHAIN KIT VINTAGE CHRISTMAS'}) ... 0.948911
4 frozenset({'60 CAKE CASES VINTAGE CHRISTMAS'}) ... 0.940866

[5 rows x 10 columns]

Basic statistics of the rules:
support confidence lift
count 1283.000000 1283.000000 1283.000000
mean 0.013034 0.426733 12.838750
std 0.003414 0.168999 8.045014
min 0.010022 0.200758 2.212530
25% 0.010840 0.279964 8.145205
50% 0.011822 0.391172 11.150338
75% 0.013663 0.544133 15.220856
max 0.033748 0.947955 64.070404

Results and Insights

  • Optimized Parameters for Apriori:
    • Based on runtime and the number of rules, min_support of 0.02 and min_confidence of 0.3 provide a good balance.
  • Optimized Parameters for FP-Growth:
    • For FP-Growth, min_support of 0.03 and min_confidence of 0.4 are optimal.
  • Comparative Performance:
    • FP-Growth generally runs faster than Apriori for lower min_support values.
    • Apriori tends to generate more frequent itemsets for the same parameter values compared to FP-Growth.
    • Both algorithms show that higher min_confidence values reduce the number of rules, but Apriori generates slightly more rules.

Conclusion

This project demonstrates the application of Market Basket Analysis using Apriori and FP-Growth algorithms. By optimizing and analyzing the rules, valuable insights into customer purchasing behavior can be derived, aiding in better decision-making for product placement, cross-selling, and targeted marketing.

References

  1. Apriori Algorithm:

  2. FP-Growth Algorithm:

  3. Association Rule Learning:

If you have any issues or need any help running the code, please reach out!