Made withJupyter PyPI license

Market Basket Analysis using Association rule learning

What is Association rule learning?

Association rule learning is a machine learning technique used to identify patterns, associations, and relationships between variables in large datasets. It is often applied in market basket analysis, where the goal is to discover which products are frequently bought together by customers. In market basket analysis, confidence, support, and lift are used to measure the strength of association between items purchased by customers.

  1. Confidence: Confidence in market basket analysis measures the conditional probability of a product Y being purchased given that product X has already been purchased. A high confidence value indicates that customers who bought item X are likely to also buy item Y.

  2. Support: Support measures the frequency of occurrence of an item or a set of items in all transactions. A high support value indicates that the item or set of items is frequently bought together by customers.

  3. Lift: Lift measures the strength of association between two items by comparing the observed frequency of co-occurrence of both items with the expected frequency of co-occurrence under the assumption that the items are independent. A lift value greater than 1 indicates that the items are positively correlated, a lift value of 1 indicates no correlation, and a lift value less than 1 indicates that they are negatively correlated. A high lift value indicates that the two items are likely to be bought together, and the lift value can be used to identify strong association rules between items.

Two approaches of Asscoaite rule learning, known as appriori and FP growth, have been used to do a market basket analysis. My code is very simple to understand and can be extended by more analysis.

## data of this code can be downloaded from the following link:
#https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset/code?datasetId=877335&sortBy=voteCount    
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.frequent_patterns import fpgrowth
#reading data
dataset = pd.read_csv("Groceries_dataset.csv")
#Member_number is unique for each customer. Date is date of the transaction,
#itemDescription is the product bought for this date.
print(dataset.head())
print(dataset.shape)
#################################
#checking missing values
nan_values = dataset.isna().sum()
print(nan_values)
   Member_number        Date   itemDescription
0           1808  21-07-2015    tropical fruit
1           2552  05-01-2015        whole milk
2           2300  19-09-2015         pip fruit
3           1187  12-12-2015  other vegetables
4           3037  01-02-2015        whole milk
(38765, 3)
Member_number      0
Date               0
itemDescription    0
dtype: int64
##Basket analysis: we can see the bought products of clients for everyday
client_basket=dataset.groupby(['Member_number','Date'])['itemDescription'].apply(sum)
print(client_basket)
## we can see the bought products of clients for every day withot client number
clinet_basket2 = [a[1]['itemDescription'].tolist() for a in list(dataset.groupby(['Member_number','Date']))]
print(clinet_basket2[0:10])
Member_number  Date      
1000           15-03-2015           sausagewhole milksemi-finished breadyogurt
               24-06-2014                          whole milkpastrysalty snack
               24-07-2015                           canned beermisc. beverages
               25-11-2015                              sausagehygiene articles
               27-05-2015                               sodapickled vegetables
                                                   ...                        
4999           24-01-2015    tropical fruitberriesother vegetablesyogurtkit...
               26-12-2015                                   bottled waterherbs
5000           09-03-2014                          fruit/vegetable juiceonions
               10-02-2015               sodaroot vegetablessemi-finished bread
               16-11-2014                         bottled beerother vegetables
Name: itemDescription, Length: 14963, dtype: object
[['sausage', 'whole milk', 'semi-finished bread', 'yogurt'], ['whole milk', 'pastry', 'salty snack'], ['canned beer', 'misc. beverages'], ['sausage', 'hygiene articles'], ['soda', 'pickled vegetables'], ['frankfurter', 'curd'], ['sausage', 'whole milk', 'rolls/buns'], ['whole milk', 'soda'], ['beef', 'white bread'], ['frankfurter', 'soda', 'whipped/sour cream']]
#Converting Date into DateTime type
Date=dataset.set_index(['Date'])
Date.index=pd.to_datetime(Date.index, infer_datetime_format= True)
## Data analysis
#The items sold per month, week, and day have been plotted to see the variations and the average level of sale.

fig1 = plt.figure("Figure 1")
ax = plt.axes()
ax.set_facecolor('silver')
Date.resample("D")['itemDescription'].count().plot(figsize=(12,5), grid=True,
color='black').set(xlabel="Date", ylabel="Total Number of Items Sold")
Date.resample("W")['itemDescription'].count().plot(figsize=(12,5), grid=True,
color='blue').set(xlabel="Date", ylabel="Total Number of Items Sold")
Date.resample("M")['itemDescription'].count().plot(figsize=(12,5), grid=True,
color='red',title="Items Sold per month in red, per week in blue and per day in black").set(xlabel="Date", ylabel="Total Number of Items Sold")
[Text(0.5, 0, 'Date'), Text(0, 0.5, 'Total Number of Items Sold')]

png

#The number of customers per month, week, and day has been plotted to see the variations and the average number of customers.

fig2 = plt.figure("Figure 2")
ax = plt.axes()
ax.set_facecolor('silver')
Date.resample('D')['Member_number'].nunique().plot(figsize=(12,5), grid=True,
color='black').set(xlabel="Date", ylabel="Number of customers")
Date.resample('W')['Member_number'].nunique().plot(figsize=(12,5), grid=True,
color='blue').set(xlabel="Date", ylabel="Number of customers")
Date.resample('M')['Member_number'].nunique().plot(figsize=(12,5), grid=True,
color='red',title="Number of customers per month in red, per week in blue and per day in black").set(xlabel="Date", ylabel="Total Number of Items Sold")
[Text(0.5, 0, 'Date'), Text(0, 0.5, 'Total Number of Items Sold')]

png

#The number of sales per customer for every day, every week, and each month:

fig3 = plt.figure("Figure 3")
ax = plt.axes()
day_ratio = Date.resample("D")['itemDescription'].count()/Date.resample('D')['Member_number'].nunique()
day_ratio.plot(figsize=(12,5), grid=True,
color='black').set(xlabel="Date", ylabel="Sale per customer")
week_ratio = Date.resample("W")['itemDescription'].count()/Date.resample('W')['Member_number'].nunique()
week_ratio.plot(figsize=(12,5), grid=True,
color='blue').set(xlabel="Date", ylabel="Sale per customer")
month_ratio = Date.resample("M")['itemDescription'].count()/Date.resample('M')['Member_number'].nunique()
month_ratio.plot(figsize=(12,5), grid=True,
color='red', title = "Sale per customers per month in red, per week in blue and per day in black").set(xlabel="Date", ylabel="Sale per customer")
[Text(0.5, 0, 'Date'), Text(0, 0.5, 'Sale per customer')]

png

#5 best seller items 
Item_distr = dataset.groupby(by = 'itemDescription').size().reset_index(name='Frequency').sort_values(by = 'Frequency',ascending=False).head(5)
bars = Item_distr["itemDescription"]
height = Item_distr["Frequency"]
x_pos = np.arange(len(bars))
plt.figure(figsize=(16,9))
plt.bar(x_pos, height, color = 'blue')
plt.title("Top 5 Sold Items")
plt.ylabel("Number of sold items")
plt.xticks(x_pos, bars)
plt.show()

png

What is Apriori Algorithm in market basket analysis?

The most well-known algorithm for association rule learning is the Apriori algorithm, which works by identifying frequent itemsets (i.e., combinations of items that occur together frequently) and then generating association rules from those itemsets. An association rule is a statement that indicates the likelihood of one item being purchased given that another item has been purchased.

For example, if customers who buy milk and bread also tend to buy eggs, then an association rule could be "milk and bread imply the purchase of eggs". These rules can be used to provide insights into consumer behavior, to inform marketing strategies, and to make recommendations for complementary products.

##Data preparation and modeling
## Before modeling, the transaction must be one-hot
Transactions = dataset.groupby(['Member_number', 'itemDescription'])['itemDescription'].count().unstack().fillna(0).reset_index()
Transactions.head()
def one_hot_encoder(k):
    if k <= 0:
        return 0
    if k >= 1:
        return 1
Transactions = Transactions.iloc[:, 1:Transactions.shape[1]].applymap(one_hot_encoder)
# Transactions.head()

##associate learning 1: apriori
frequent_items1 = apriori(Transactions, min_support=0.027, use_colnames=True, max_len=3).sort_values(by='support')
frequent_items1.head(10)
results1 = association_rules(frequent_items1, metric="lift", min_threshold=1).sort_values('lift', ascending=False)
results1 = results1[['antecedents', 'consequents', 'support', 'confidence', 'lift']]
results1.head(100)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
antecedents consequents support confidence lift
629 (sausage) (yogurt, rolls/buns) 0.035659 0.173101 1.554717
624 (yogurt, rolls/buns) (sausage) 0.035659 0.320276 1.554717
216 (root vegetables, whole milk) (shopping bags) 0.029246 0.258503 1.536046
221 (shopping bags) (root vegetables, whole milk) 0.029246 0.173780 1.536046
627 (yogurt) (rolls/buns, sausage) 0.035659 0.126020 1.530298
... ... ... ... ... ...
1008 (rolls/buns, whole milk) (sausage) 0.048743 0.272989 1.325167
805 (whole milk) (yogurt, bottled water) 0.040277 0.087906 1.323001
800 (yogurt, bottled water) (whole milk) 0.040277 0.606178 1.323001
725 (rolls/buns, bottled beer) (whole milk) 0.038225 0.605691 1.321939
728 (whole milk) (rolls/buns, bottled beer) 0.038225 0.083427 1.321939

100 rows × 5 columns

What is Frequent Pattern (FP) growth?

Frequent Pattern (FP) growth is a popular algorithm used for mining frequent itemsets in a transactional dataset. The FP-growth algorithm builds a compact data structure, called a frequent pattern tree (FP-tree), to represent the transactional dataset. This data structure enables efficient counting of frequent itemsets by compressing the original dataset into a set of conditional databases. The algorithm recursively constructs the FP-tree by repeatedly finding frequent items and creating conditional databases, which are then recursively processed until no more frequent itemsets can be found.

##associate learning 2: fpgrowth
frequent_items2=fpgrowth(Transactions, min_support=0.027, use_colnames=True, max_len=3).sort_values(by='support')
frequent_items2.head(10)
results2 = association_rules(frequent_items2, metric="lift", min_threshold=1).sort_values('lift', ascending=False)
results2 = results2[['antecedents', 'consequents', 'support', 'confidence', 'lift']]
results2.head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
antecedents consequents support confidence lift
627 (sausage) (yogurt, rolls/buns) 0.035659 0.173101 1.554717
622 (yogurt, rolls/buns) (sausage) 0.035659 0.320276 1.554717
208 (root vegetables, whole milk) (shopping bags) 0.029246 0.258503 1.536046
213 (shopping bags) (root vegetables, whole milk) 0.029246 0.173780 1.536046
624 (rolls/buns, sausage) (yogurt) 0.035659 0.433022 1.530298
625 (yogurt) (rolls/buns, sausage) 0.035659 0.126020 1.530298
695 (sausage) (yogurt, other vegetables) 0.037199 0.180573 1.500795
690 (yogurt, other vegetables) (sausage) 0.037199 0.309168 1.500795
369 (shopping bags) (other vegetables, soda) 0.031042 0.184451 1.485518
364 (other vegetables, soda) (shopping bags) 0.031042 0.250000 1.485518