zara-data-challenge-19

Can you forecast fashion sales over three weeks?

Methodology overview • Files included • Possible improvements • Credits • License

Methodology overview

This approach does not make use of intelligent data analysis and feature extraction. Instead, it aggregates the sales of all colours and sizes and fits a powerful forecasting model to predict the revenue. Nevertheless, this simple method was able to reach rank 31/138. The method is as follows :

Load stock_and_sales_day_0_day_<n>.csv, products.csv and product_blocks.csv. Create a denormalised table by adding the sales of all colours and sizes per product per day, and appending the block identifier. Compute the revenue as sales x price.

date_number product_id block_id sales price revenue

0 0 310130 1726 11 12.95 142.45

1 0 1178388 592 0 49.95 0.00

2 0 1561460 1625 7 29.95 209.65

3 0 1874414 1135 4 25.95 103.80

	product_id	block_id	sales	price	revenue
0	310130	1726	11	12.95	142.45
1	1178388	592	0	49.95	0.00
2	1561460	1625	7	29.95	209.65
3	1874414	1135	4	25.95	103.80

Pivot the previous table to get a new table with shape (nb_products, nb_days), or aggregate by date_number and block_id to obtain a table with shape (nb_blocks, nb_days).

product_id	X0	X1	X2	X3		X81	X82	X83	X84
151926	NaN	NaN	NaN	NaN	...	129.75	103.80	51.90	51.90
213413	NaN	NaN	NaN	59.85	...	139.65	239.40	279.30	159.60
310130	142.45	168.35	181.3	194.25	...	25.90	77.70	77.70	64.75
455200	NaN	NaN	NaN	NaN	...	0.00	0.00	0.00	0.00

block_id	X0	X1	X2	X3		X81	X82	X83	X84
0	674.60	656.90	403.20	950.40	...	1827.65	709.05	389.45	888.85
1	29.95	149.75	89.85	179.70	...	1314.75	1316.60	1105.35	940.45
2	679.40	1228.90	789.25	1138.95	...	719.25	549.45	359.65	849.15
3	53.91	5.99	41.93	83.86	...	0.00	0.00	39.95	0.00

Apply the method proposed by Montero-Manso et al. (2018) in their submission for the M4 competition. This approach combines different statistical forecasting methods. The weights of the combinations are calculated per series using a learning model based on gradient tree boosting exploiting features extracted from the time series as input. Extend the tables in step 2 with the predictions of the model.

block_id		X84	Y85	Y86	Y87	Y88	Y89	Y90	Y91
0	...	888.850	885.58	866.28	858.38	854.33	851.82	850.09	848.85
1	...	940.450	919.58	891.67	867.34	846.10	827.53	811.28	797.04
2	...	849.150	681.57	679.75	683.60	680.29	679.35	680.66	678.77
3	...	0.001	9.86	9.75	9.64	9.53	9.42	9.31	9.20

For each block_id (or product_id) in the previous table, add the resulting forecasts to get an estimation of the total revenue for the last week. Make a bet following a particular heuristic, for example, by ranking the blocks according to their predicted revenue and picking from the top of the ranking until the chosen blocks contain 50 products (greedy, not a very good heuristic tho). A better heuristic would take into account the number of products in each block, giving preference to blocks with less number of items.

Files included

generate_ts_datasets.ipynb preprocesses the original data to generate a time series dataset of (nb_blocks, nb_days). Each time series represents the revenue over time of a particular block (steps 1 and 2). It can be easily modified to generate the same dataset but for products.
generate_forecasts.R applies the proposed forecasting model to generate predictions for the following week (step 3).
generate_bets.ipynb uses the previous forecasts to pick a bet (step 4).
data/preprocessed the output of step 2 for each day of the competition.
data/model the output of step 3 for each day of the competition. Also the xgboost model trained for the last day.

I have not included the original data. If you want to reproduce my results you must place this data (i.e., the .csv files) in data/raw.

Possible improvements

Compute additional features from the time series by using the information in the positions_day_0_day_<n>.csv file (e.g., the best position of a product across all colours and sizes for each day).
Improve the heuristic used in generate_bets.ipynb.

Credits

I do not own the code used to generate the forecasts. As I mentioned, I used the method proposed by Montero-Manso et al. (2018) for the M4 competition. You can find the original code at robjhyndman/M4metalearning. I adapted the code for the dataset provided by ZARA.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.