Walmart Recruiting - Store Sales Forecasting

Technical assestment | Neaam Hariri

Probelm statment

In this recruiting competition, job-seekers are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants must project the sales for each department in each store. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact.

Source : Link to dataset

Dataset :

stores.csv
features.csv
train.csv
sampleSubmission.csv

table of content

download data
explore data
data cleaning
visualizing
findings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Loading and exploring the data

#Read CSV file 
df_features = pd.read_csv('features.csv', sep=',')
df_stores = pd.read_csv('stores.csv', sep=',')
df_train = pd.read_csv('train.csv', sep=',')
df_test = pd.read_csv('test.csv', sep=',')

df_features.info()
df_features.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8190 entries, 0 to 8189
Data columns (total 12 columns):
Store           8190 non-null int64
Date            8190 non-null object
Temperature     8190 non-null float64
Fuel_Price      8190 non-null float64
MarkDown1       4032 non-null float64
MarkDown2       2921 non-null float64
MarkDown3       3613 non-null float64
MarkDown4       3464 non-null float64
MarkDown5       4050 non-null float64
CPI             7605 non-null float64
Unemployment    7605 non-null float64
IsHoliday       8190 non-null bool
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 711.9+ KB

	Store	Date	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment	IsHoliday
0	1	2010-02-05	42.31	2.572	NaN	NaN	NaN	NaN	NaN	211.096358	8.106	False
1	1	2010-02-12	38.51	2.548	NaN	NaN	NaN	NaN	NaN	211.242170	8.106	True
2	1	2010-02-19	39.93	2.514	NaN	NaN	NaN	NaN	NaN	211.289143	8.106	False
3	1	2010-02-26	46.63	2.561	NaN	NaN	NaN	NaN	NaN	211.319643	8.106	False
4	1	2010-03-05	46.50	2.625	NaN	NaN	NaN	NaN	NaN	211.350143	8.106	False

df_stores.info()
df_stores.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 3 columns):
Store    45 non-null int64
Type     45 non-null object
Size     45 non-null int64
dtypes: int64(2), object(1)
memory usage: 1.1+ KB

	Store	Type	Size
0	1	A	151315
1	2	A	202307
2	3	B	37392
3	4	A	205863
4	5	B	34875

df_train.info()
df_train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null object
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB

	Store	Dept	Date	Weekly_Sales	IsHoliday
0	1	1	2010-02-05	24924.50	False
1	1	1	2010-02-12	46039.49	True
2	1	1	2010-02-19	41595.55	False
3	1	1	2010-02-26	19403.54	False
4	1	1	2010-03-05	21827.90	False

df_test.info()
df_test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115064 entries, 0 to 115063
Data columns (total 4 columns):
Store        115064 non-null int64
Dept         115064 non-null int64
Date         115064 non-null object
IsHoliday    115064 non-null bool
dtypes: bool(1), int64(2), object(1)
memory usage: 2.7+ MB

	Store	Dept	Date	IsHoliday
0	1	1	2012-11-02	False
1	1	1	2012-11-09	False
2	1	1	2012-11-16	False
3	1	1	2012-11-23	True
4	1	1	2012-11-30	False

Converting to appropriate datatype, checking null values

# converting date to datetime object 
#df_features['Date'] = df_features['Date'].apply(pd.to_datetime)
print(f"Minimum Date: {min(df_features['Date'])}")
print(f"\nMaximum Date: {max(df_features['Date'])}")

Minimum Date: 2010-02-05

Maximum Date: 2013-07-26

#check null values 
df_features.isnull().sum()

Store              0
Date               0
Temperature        0
Fuel_Price         0
MarkDown1       4158
MarkDown2       5269
MarkDown3       4577
MarkDown4       4726
MarkDown5       4140
CPI              585
Unemployment     585
IsHoliday          0
dtype: int64

df_features.fillna(0,inplace=True)
df_features.isnull().sum()

Store           0
Date            0
Temperature     0
Fuel_Price      0
MarkDown1       0
MarkDown2       0
MarkDown3       0
MarkDown4       0
MarkDown5       0
CPI             0
Unemployment    0
IsHoliday       0
dtype: int64

df_features[["Temperature"]] = df_features[["Temperature"]].fillna(df_features[["Temperature"]].mean())
df_features[["Fuel_Price"]] = df_features[["Fuel_Price"]].fillna(df_features[["Fuel_Price"]].mean())
df_features[["CPI"]] = df_features[["CPI"]].fillna(df_features[["CPI"]].mean())

df_features.isnull().sum()

Store           0
Date            0
Temperature     0
Fuel_Price      0
MarkDown1       0
MarkDown2       0
MarkDown3       0
MarkDown4       0
MarkDown5       0
CPI             0
Unemployment    0
IsHoliday       0
dtype: int64

print(f"Minimum Date: {min(df_train['Date'])}")
print(f"\nMaximum Date: {max(df_train['Date'])}")

Minimum Date: 2010-02-05

Maximum Date: 2012-10-26

df_train.isnull().sum()

Store           0
Dept            0
Date            0
Weekly_Sales    0
IsHoliday       0
dtype: int64

labels = 'type A','type B','type C'
sizes = [(df_stores.Type.value_counts()['A']/(45))*100,(df_stores.Type.value_counts()['B']/(45))*100,(df_stores.Type.value_counts()['C']/(45))*100]
colors = ['green', 'yellow', 'blue']
explode = (0.1, 0, 0)  

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%.1f%%', shadow=False, startangle=90)
plt.legend(labels, loc="best")
plt.axis('equal')
plt.show()

df_stores.Type.value_counts()

A    22
B    17
C     6
Name: Type, dtype: int64

Merging stores, features, train in one dateframe

df_features_stores = df_features.merge(df_stores, how='inner', on='Store')
df = df_train.merge(df_features_stores, how='inner', on=['Store','Date','IsHoliday'])

df['Date'] = pd.to_datetime(df['Date'])



df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421570 entries, 0 to 421569
Data columns (total 16 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null datetime64[ns]
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
Temperature     421570 non-null float64
Fuel_Price      421570 non-null float64
MarkDown1       421570 non-null float64
MarkDown2       421570 non-null float64
MarkDown3       421570 non-null float64
MarkDown4       421570 non-null float64
MarkDown5       421570 non-null float64
CPI             421570 non-null float64
Unemployment    421570 non-null float64
Type            421570 non-null object
Size            421570 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(10), int64(3), object(1)
memory usage: 51.9+ MB

df["year"] = df["Date"].dt.year
df["month"] = df["Date"].dt.month
df["week"] = df["Date"].dt.month

sales visualizations

sales_date = df['Weekly_Sales'].groupby(df['Date']).mean()

plt.figure(figsize=(16,5))
sns.lineplot(weekly_sales_mean.index, weekly_sales_mean.values)

plt.grid()
plt.title('sales over time ', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Date', fontsize=16)
plt.show()

sales_store = df['Weekly_Sales'].groupby(df['Store']).mean()

plt.figure(figsize=(16,5))
sns.barplot(sales_store.index, sales_store.values)

plt.grid()
plt.title('sales per store ', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()

sales_store.min()

5053.415812868089

sales_store.max()

29508.30159193265

sales_store.max() - sales_store.min()

24454.88577906456

df.groupby("Type")["Weekly_Sales"].mean()

Type
A    20099.568043
B    12237.075977
C     9519.532538
Name: Weekly_Sales, dtype: float64

# # Figure Size
# fig = plt.figure(figsize =(10, 7))
# # Horizontal Bar Plot
# plt.bar(df.Store, df.Weekly_Sales)
# # Show Plot
# plt.show()

hol = df['Weekly_Sales'].groupby(df['Size']).mean()

plt.figure(figsize=(16,5))
sns.lineplot(hol.index, hol.values)

plt.grid()
plt.title('sales - size ', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('size', fontsize=16)
plt.show()

hol = df.groupby("Fuel_Price")["Weekly_Sales"].mean().reset_index()
print(hol["Fuel_Price"].corr(hol["Weekly_Sales"]))
hol.plot.scatter("Fuel_Price", "Weekly_Sales", title="Weekly_Sales mean vs fuel price")

0.015810449498679695





<matplotlib.axes._subplots.AxesSubplot at 0x1ea083c1da0>

data = pd.concat([df['Store'], df['Weekly_Sales'], df['IsHoliday']], axis=1)
f, ax = plt.subplots(figsize=(23, 8))
fig = sns.boxplot(x='Store', y='Weekly_Sales', data=data, showfliers=False, hue="IsHoliday")

We can see that holidays have higher weekly sales than non holidays

CPI_corr = df.groupby("CPI")["Weekly_Sales"].mean().reset_index()
print(CPI_corr["CPI"].corr(hol["Weekly_Sales"]))
CPI_corr.plot.scatter("CPI", "Weekly_Sales", title="Weekly_Sales mean vs CPI")

-0.018861475583658933





<matplotlib.axes._subplots.AxesSubplot at 0x1ea1563b400>

holidays = df[['Weekly_Sales','Date','week','year','IsHoliday']]
holidays = holidays.loc[holidays['IsHoliday']==True].drop_duplicates()
holidays.head()

	Weekly_Sales	Date	week	year	IsHoliday
73	46039.49	2010-02-12	2	2010	True
74	44682.74	2010-02-12	2	2010	True
75	10887.84	2010-02-12	2	2010	True
76	35351.21	2010-02-12	2	2010	True
77	29620.81	2010-02-12	2	2010	True

NeaamHariri/Walmart-Recruiting---Store-Sales-Forecasting

Walmart Recruiting - Store Sales Forecasting

Technical assestment | Neaam Hariri

Probelm statment

Source : Link to dataset

Dataset :

table of content

Loading and exploring the data

Converting to appropriate datatype, checking null values

Merging stores, features, train in one dateframe

sales visualizations

We can see that holidays have higher weekly sales than non holidays

1- no correlation bteween sales and CPI or Fuel price

2- holidays have higher sales

3- bigger stores have higher sales over time

4- type A of stoers have the highest sales avarege while C have the lowst (logiclly because A is 48% of stores type)

5- 24454.88577906456 is the different between the minimum (store 5) store's sales and the maximum (store 20)

6- end of years have high weekly sales while first months have low weekly sales