Big Mart Sales

References

[1] https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

Description and problem statement

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart can try to understand the properties of products and stores which play a key role in increasing sales.

Data

Here are the features and their description.

Imputing of missing values

Several of the features had missing values or values that needed to be corrected.

Item_Weight

We impute the missing values of the Item_Weight by the average Item_Weight of each Item_Identifier. We can see that the values we have chosen to replace the missing weights are reasonable as the boxplot of the affected outlets now follows the same pattern as the other outlets.

Outlet_Size

Grocery Stores and Supermarkets of Type 1 have missing values, as shown in the image below.

Grocery Stores All the non missing values in Grocery Stores are 'small'. So all the missing values in Outlet_Size of Grocery Stores are replaced with 'small'.

All the other missing values in the rest of the data set are replaced with the mode values for each Store Type, from the pivot table below.

Item_Visibility

The min value of Item_Visibility is 0, but this can not be as every item must have some visibility.

879 out of 14204 is a lot so we replace the 0 values for NAN values so the mean value is not affected.

We impute missing values for each Item_Type in each Outlet_Type, from the pivot table below.

Item_Fat_Content

There are categories that can be conbined: Low Fat, low fat and LF are all Low Fat; reg and Regular are both Regular.

Feature engineering

We did the following feature engineering:

Converted the Outlet_Establishment_Years into how old the establishments are, feature Outlet_Age
Created broader categories for type of item: Food, Drink and Non-Consumable.
Changed value of the 'Item_Fat_Content' of the items that are non-consumables, to Non-Edible
Made a new category for items that reflect their sales: The Item_MRP illustrated in the image below clearly shows there are 4 different price categories. So we define them to be 'Low', 'Medium', 'High', 'Very High'.
The Item_MRP does not change significantly accross the stores:

The Item_Outlet_Sales is the number of items sold times the Item_MRP. So we made a new variable with the number of items sold (by dividing the Item_Outlet_Sales by Item_MRP).

Preliminary Analysis

Item_Number_Sales

Item_outlet_sales and Item_MRP vs Item_Visibility

There is a positive correlation between Item_MRP and Item_Outlet_Sales and a negative correlation between Item_Outlet_Sales and visibility.

There is no correlation Item_MRP and Item_Number_Sales and there is a negative correlation between Item_Number_Sales and visibility.

Correlation between Item_MRP and Item_Outlet_Sales: 0.5675744466569193 Correlation between Item_MRP and Item_Number_Sales: 0.01114352701232483

Correlation between Item_Visibility and Item_Outlet_Sales: -0.14076174687662235 Correlation between Item_Visibility and Item_Number_Sales: -0.17440844918045084

Preparation of data for model building

Numerical and One-Hot Coding of Categorical Variables¶
Standardisation of numerical data - More on this later
Separate train and test datasets

Models

Baseline models

Average sales - Replace missing values by the average sales for all items. This is how the resulting data looks:

Average Sales by Item_Type_Category - Replace missing values by the average sales per Item_Type_Category from this pivot table:

This is how the resulting data looks:

Average Sales by Product_Type_Category in Particular Outlet_Type - Replace missing values by the average sales per Item_Type_Category in each Outlet_Type from this pivot table: