Utilized facebook prophet to perform forecasting on datasets that consist sales data from 1115 stores. Our predictive model attempts at forecasting future sales based on historical data while taking into account seasonality effects, demand, holidays, promotions, and competition.
For the dataset that i used on this project, i put it on google drive and you can see it using this link : https://drive.google.com/drive/u/0/folders/1yWxgxkqNPTcVkJBbHNefgyAjzYAv3sTP
For companies to become competitive and skyrocket their growth, they need to leverage AI/ML to develop predictive models to forecast sales in the future. Predictive models attempt at forecasting future sales based on historical data while taking into account seasonality effects, demand, holidays, promotions, and competition.
In this project, we tried to predict future daily sales based on the features of 1115 stores. We used facebook prophet for our predictive model. Facebook prophet is open source software released by Facebook's Core Data Science Team. Prophet is a procedure for forecasting time series data based on additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. Facebook Prophet works best with time series that have strong seasonal effects and several seasons of historical data.
We used two csv files for our dataset, the first ones is csv contains the information of sales from 1115 stores and the second ones is csv contains the information of 1115 stores.
The following is display of first two rows of the datasets :
Store | DayofWeek | Date | Sales | Customers | Open | Promo | StateHoliday | SchoolHolidays |
---|---|---|---|---|---|---|---|---|
1 | 5 | 2015-07-31 | 5263 | 555 | 1 | 1 | 0 | 1 |
2 | 5 | 2015-07-31 | 6064 | 625 | 1 | 1 | 0 | 1 |
- Id: transaction ID (combination of Store and date)
- Store: unique store Id
- Sales: sales/day, this is the target variable
- Customers: number of customers on a given day
- Open: Boolean to say whether a store is open or closed (0 = closed, 1 = open)
- Promo: describes if store is running a promo on that day or not
- StateHoliday: indicate which state holiday (a = public holiday, b = Easter holiday, c = Christmas, 0 = None)
- SchoolHoliday: indicates if the (Store, Date) was affected by the closure of public schools
The following is display of first two rows of the datasets :
Store | StoreType | Assortment | CompetitionDistance | CompetitionOpenSinceMonth | Promo2 | Promo2SinceWeek | Promo2SinceYear | PromoInterval |
---|---|---|---|---|---|---|---|---|
1114 | a | c | 870.0 | NaN | 0 | NaN | NaN | NaN |
1112 | d | c | 5350.0 | NaN | 1 | 22.0 | 2012.0 | Mar,Jun,Sept,Dec |
- StoreType: categorical variable to indicate type of store (a, b, c, d)
- Assortment: describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance (meters): distance to closest competitor store
- CompetitionOpenSince [Month/Year]: provides an estimate of the date when competition was open
- Promo2: Promo2 is a continuing and consecutive promotion for some stores (0 = store is not participating, 1 = store is participating)
- Promo2Since [Year/Week]: date when the store started participating in Promo2
- PromoInterval: describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
Fortunately we don't have any missing values, lets proceed with data visualization
- Average 600 customers per day, maximum is 4500 (note that we can't see the outlier at 7388!)
- Data is equally distibuted across various Days of the week (~150000 observations x 7 day = ~1.1 million observation)
- Stores are open ~80% of the time
- Data is equally distributed among all stores (no bias)
- Promo #1 was running ~40% of the time
- Average sales around 5000-6000 Euros
- School holidays are around ~18% of the time
Now lets see how many stores that are open and closed
Lets keep open stores and remove closed stores. Open column has no meaning now, lets drop the column
The following is columns with missing values and how we handle it :
- CompetitionDistance with 3 missing values (we fill them up with average values of the 'CompetitionDistance' columns)
- CompetitionOpenSinceMonth with 354 missing values (We fill them up with zero)
- CompetitionOpenSinceYear with 354 missing values (we fill them up with zero)
- Promo2SinceWeek with 544 missing values (we fill them up with zero)
- Promo2SinceYear with 544 missing values (we fill them up with zero)
- PromoInterval with 544 missing values (we fill them up with zero)
The reason we fill them up with zero because the value of promo2 column. It seems like if 'promo2' is zero, 'promo2SinceWeek', 'Promo2SinceYear', and 'PromoInterval' information is set to zero. If there are no promo, naturally there are no competition as well.
- half of stores are involved in promo 2
- half of the stores have their competition at a distance of 0-3000m (3 kms away)
We succesfully cleaned the dataset, lets merge them into one dataset. The following is first two row of merged dataset :
Store | DayofWeek | Date | Sales | Customers | Promo | StateHoliday | SchoolHolidays | StoreType | Assortment | CompetitionDistance | CompetitionOpenSinceMonth | Promo2 | Promo2SinceWeek | Promo2SinceYear | PromoInterval |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5 | 2015-07-31 | 5263 | 555 | 1 | 0 | 1 | c | a | 1270.0 | 9.0 | 2008.0 | 0 | 0.0 | 0.0 |
1 | 4 | 2015-07-30 | 5020 | 546 | 1 | 0 | 1 | c | a | 1270.0 | 9.0 | 2008.0 | 0 | 0.0 | 0.0 |
- Customers/Prmo2 and sales are strongly correlated
Before we do another visualization, we separate month,day,and year into separate columns
It looks like sales and number of customers peak around christmas timeframe
- Minimum number of customers are generally around the 24th of the month
- Most customers and sales are around 30th and 1st of the month
It looks like sales and number of customers peak around Saturday and Sunday
- Store type b is stores with highest numbers of average sales
- Store type a is stores with lowerst numbers of average sales
Promo can increased the number of sales and customers
We utilized facebook prophet for our predictive model. We trained the model with historical data of sales from each stores. The following is the result of our forecasting of store number 10 sales for 60 days :
In this part, we incorporated the holidays information into our model. The following is the result of our forecasing of store number 6 sales for 90 days :