/Big-Mart-Sales

My solution to the Big Mart Sales Competition https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

Primary LanguageJupyter Notebook

Big Mart Sales

References

[1] https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

Description and problem statement

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart can try to understand the properties of products and stores which play a key role in increasing sales.

Data

Here are the features and their description. alt text

Imputing of missing values

Several of the features had missing values or values that needed to be corrected.

Item_Weight

We impute the missing values of the Item_Weight by the average Item_Weight of each Item_Identifier. We can see that the values we have chosen to replace the missing weights are reasonable as the boxplot of the affected outlets now follows the same pattern as the other outlets.

alt text

alt text

Outlet_Size

Grocery Stores and Supermarkets of Type 1 have missing values, as shown in the image below.

alt text

Grocery Stores All the non missing values in Grocery Stores are 'small'. So all the missing values in Outlet_Size of Grocery Stores are replaced with 'small'.

All the other missing values in the rest of the data set are replaced with the mode values for each Store Type, from the pivot table below.

alt text

Item_Visibility

The min value of Item_Visibility is 0, but this can not be as every item must have some visibility.

alt text

879 out of 14204 is a lot so we replace the 0 values for NAN values so the mean value is not affected.

We impute missing values for each Item_Type in each Outlet_Type, from the pivot table below.

alt text

Item_Fat_Content

There are categories that can be conbined: Low Fat, low fat and LF are all Low Fat; reg and Regular are both Regular.

alt text

Feature engineering

We did the following feature engineering:

  • Converted the Outlet_Establishment_Years into how old the establishments are, feature Outlet_Age alt text
  • Created broader categories for type of item: Food, Drink and Non-Consumable.
  • Changed value of the 'Item_Fat_Content' of the items that are non-consumables, to Non-Edible
  • Made a new category for items that reflect their sales: The Item_MRP illustrated in the image below clearly shows there are 4 different price categories. So we define them to be 'Low', 'Medium', 'High', 'Very High'. alt text
  • The Item_MRP does not change significantly accross the stores:

alt text

The Item_Outlet_Sales is the number of items sold times the Item_MRP. So we made a new variable with the number of items sold (by dividing the Item_Outlet_Sales by Item_MRP).

alt text

Preliminary Analysis

Item_Number_Sales

alt text

alt text

alt text

Item_outlet_sales and Item_MRP vs Item_Visibility

There is a positive correlation between Item_MRP and Item_Outlet_Sales and a negative correlation between Item_Outlet_Sales and visibility.

There is no correlation Item_MRP and Item_Number_Sales and there is a negative correlation between Item_Number_Sales and visibility.

alt text

alt text

alt text

alt text

Correlation between Item_MRP and Item_Outlet_Sales: 0.5675744466569193 Correlation between Item_MRP and Item_Number_Sales: 0.01114352701232483

Correlation between Item_Visibility and Item_Outlet_Sales: -0.14076174687662235 Correlation between Item_Visibility and Item_Number_Sales: -0.17440844918045084

Preparation of data for model building

  • Numerical and One-Hot Coding of Categorical Variables¶
  • Standardisation of numerical data - More on this later
  • Separate train and test datasets

Models

Baseline models

  • Average sales - Replace missing values by the average sales for all items. This is how the resulting data looks:

alt text

  • Average Sales by Item_Type_Category - Replace missing values by the average sales per Item_Type_Category from this pivot table:

alt text

This is how the resulting data looks:

alt text

  • Average Sales by Product_Type_Category in Particular Outlet_Type - Replace missing values by the average sales per Item_Type_Category in each Outlet_Type from this pivot table:

alt text

This is how the resulting data looks:

alt text

Feature selection with Recursive Feature Elimination and a RandomForestRegressor

Hot-coding of the categorical variables leaves a total of 56 features in total (numerical and categorical). Using Recursive Feature Elimination (rfe) from the sklearn package we choose the top 16 predictive features to build the rest of the predictive model, while avoiding over-fitting. These are the features chosen:

alt text

Models performance comparison

Model Parameter Values Validation dataset RMSE CV score
Average Sales - 1652 -
Average Sales by Item_Type_Category - 1651 -
Average Sales by Product_Type_Category in Particular Outlet_Type - 1417 -
Regression - 1143 Mean - 1222 (+/- 142.71), Std - 71.35, Min - 1028, Max - 1312
Regression Ridge alpha = 0.001 1143 Mean - 1222 (+/- 143.25), Std - 71.63, Min - 1026, Max - 1313
Decision Tree Regressor max_depth = 10.6, min_samples_leaf = 0.01 1103 Mean - 1180 (+/- 151.02), Std - 75.51 , Min - 975.5, Max - 1282
Neural Network layers = 3, nodes/layer = 100 1101 -