Online retail data set is a data set taken from Machine learning Repository website (http://archive.ics.uci.edu/ml/datasets/online+retail).The context of the data set is to predict customer segmentation. With the help of this data we will learn about some of the basic marketing analytical skills. We will create our own RFM model (Recency, frequency, monetary value), perform K- Mean clustering and make prediction about customer loyalty.
Details about Data Set:
The data source comprises of data set the details of which are as follows;
InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.
Project Description
In this project we will be performing tasks that are normally performed in while conducting marketing analytical research. Marketing concepts such as customer segmentation, calculating the Recency, frequency, monetary value in order to identify customer loyalty. We will use RFM model in order to calculate the recency, frequency, monetary value and Customer Segmentation.
#Import all the required librariesimportnumpyasnpimportpandasaspdimportseabornassnsimportmatplotlib.pyplotasplt%matplotlibinlinesns.set()
Data Exploration
# Read 'Data.csv' into a DataFrame named datadata=pd.read_csv('C:/Users/sabih/OneDrive/Desktop/OnlineRetail.csv', encoding='unicode_escape' )
#(A parameter unicode_escape has been used to avoid any error that might have occured due to certaininvalid character)# Examine the head of the DataFramedata.head()
# Check the top ten countries in the dataset with highest transactionsdata.Country.value_counts(normalize=True).head(10).mul(100).round(1).astype(str) +'%'
United Kingdom 91.4%
Germany 1.8%
France 1.6%
EIRE 1.5%
Spain 0.5%
Netherlands 0.4%
Belgium 0.4%
Switzerland 0.4%
Portugal 0.3%
Australia 0.2%
Name: Country, dtype: object
90% of records belong to the sales are from United Kingdom
# Examine the shape of the DataFrameprint(data.shape)
(541909, 8)
# Print the data types of datasetprint(data.dtypes)
Since we see repitation in the country column as each customer ID and description has been linked to their respective country hence we will call out customer distribution in terms of country
# Dropping the duplicate valuescountry_data=data[['Country','CustomerID']].drop_duplicates()
#Customer count in respect to their specific countrycountry_data.groupby(['Country'])['CustomerID'].aggregate('count').reset_index().sort_values('CustomerID', ascending=False)
Since we have a large majority of data from UK hence we will focus on the UK data and drop the remaining countries data
#Customers only from United Kingdomdata=data.query("Country=='United Kingdom'").reset_index(drop=True)
#Check for missing values in the datasetdata.isnull().sum(axis=0)
print(data)
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
... ... ... ... ...
495473 581585 22466 FAIRY TALE COTTAGE NIGHT LIGHT 12
495474 581586 22061 LARGE CAKE STAND HANGING STRAWBERY 8
495475 581586 23275 SET OF 3 HANGING OWLS OLLIE BEAK 24
495476 581586 21217 RED RETROSPOT ROUND CAKE TINS 24
495477 581586 20685 DOORMAT RED RETROSPOT 10
InvoiceDate UnitPrice CustomerID Country
0 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 12/1/2010 8:26 3.39 17850.0 United Kingdom
... ... ... ... ...
495473 12/9/2011 12:31 1.95 15804.0 United Kingdom
495474 12/9/2011 12:49 2.95 13113.0 United Kingdom
495475 12/9/2011 12:49 1.25 13113.0 United Kingdom
495476 12/9/2011 12:49 8.95 13113.0 United Kingdom
495477 12/9/2011 12:49 7.08 13113.0 United Kingdom
[495478 rows x 8 columns]
#Removing the missing values from CustomerID columndata=data[pd.notnull(data['CustomerID'])]
#Checking whether the Quantity column has a negative value in itdata.Quantity.min()
-80995
data.UnitPrice.min()
0.0
#Checking whether the Quantity column has a negative value in itdata=data[(data['Quantity']>0) & (data['UnitPrice']>0)]
data.describe()
# Bar plotplt.figure(figsize=(8,8))
plt.title("Monthly Sales", fontsize=20)
sns.barplot(monthly_sales.index, monthly_sales['Quantity'])
<matplotlib.axes._subplots.AxesSubplot at 0x135e4804400>
Majority of Sales occur in betwene the month of September and December Saint Nicholas Day (Christian) ,Fiesta of Our Lady of Guadalupe (Mexican), St. Lucia Day (Swedish), Hanukkah (Jewish), Christmas Day (Christian), Three Kings Day/Epiphany (Christian)
We observe from the above results that average recency of the customers is almost 92 days, an average customer purchases a product 90 times and spending an average of 1863.91 unitprice.
#Recency distribution plotx=RFM_data['Recency']
ax=sns.distplot(x)
#Frequency distribution plot, taking observations which have frequency less than 1000x=RFM_data.query('Frequency < 1000')['Frequency']
ax=sns.distplot(x)
#Monateray distribution plot, taking observations which have monetary value less than 10000x=RFM_data.query('Monetary < 10000')['Monetary']
ax=sns.distplot(x)
Customer Segmentation
Market segmentation is the activity of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers based on some type of shared characteristics
#Now we will create : RFMGroup and RFMScoreRFM_data['RFM_Group'] =RFM_data['R_score'].astype(str) +RFM_data['F_score'].astype(str) +RFM_data['M_score'].astype(str)
#ScoreRFM_data['RFM_Score'] =RFM_data[['R_score','F_score','M_score']].sum(axis=1)
RFM_data.head()