The Spark folder of this repository was written using Databricks if you want to replicate or continue the work you can checkout the free version Databrick community.
The main goal of the repository is to use the Spark structure from Databricks clusters, load and process data from the Kaggle competition and train deep learning models distributed.
- Brief EDA of the data set. [link]
- Creation and usage of custom spark pipelines. [link]
- Data preparation. [link]
- Model training. [link]
- Model prediction (test set). [link]
- Model evaluation (evaluation of many different models. [link]
link for the Kaggle competition: https://www.kaggle.com/c/demand-forecasting-kernels-only
datasets: https://www.kaggle.com/c/demand-forecasting-kernels-only/data
This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset.
You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.
What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost?
This is a great competition to explore different models and improve your skills in forecasting.
- Persistence of the pipeline classes needs to be fixed.
- Pipeline classes needs revised.
- The data probably needs more feature extraction.