This is a time series forecasting project based on the Wikipedia Web Traffic Time Series Forecasting dataset from Kaggle. Two RNN architectures are implemented:
- A "Vanilla" RNN regressor.
- A Seq2seq regressor.
Both are implemented in TensorFlow 2, with custom training functions optimized with Autograph.
Main files:
config.yaml
: config file for hyperparameters.dataprep.py
: data preprocessing pipeline.train.py
: training pipeline.tools.py
: contains useful processing functions to be iterated in main pipelines.model.py
: builds model.
I also added a visualize_performance.ipynb
Jupyter Notebook to visually inspect models' performance on Test data.
Folders:
/data_raw/
: requires unzippedtrain_2.csv
file from Kaggle. Available is animputed.csv
dataset, containing imputed time series, coming from my other repository on a GAN for imputation of missing data in time series./data_processed/
: divided in/Train/
and/Test/
directories./saved_models/
: contains all saved TensorFlow models, both regressors./utils/
: for pics and other secondary files.
After you clone the repository locally, download the raw dataset from Kaggle, and place unzipped train_2.csv
file in /data_raw/
folder.
Then, time series forecast is executed in two steps. First, run data preprocessing pipeline:
python -m dataprep
This will generate Training+Validation and Test files, stored in /data_processed/
subdirectories. Second, launch training pipeline with:
python -m train
This will either create, train and save a new model, or load and train an already existing one, stored in /saved_models/
folder.
Finally, Test set performance will be evaluated from test.ipynb
notebook.
numpy==1.18.3
pandas==1.0.3
scikit-learn==0.22.2.post1
scipy==1.4.1
tensorflow==2.1.0
tqdm==4.45.0
I used a pretty powerful laptop, with 64GB or RAM and NVidia RTX 2070 GPU. I highly recommend GPU training to avoid excessive computational times.