/Addiction-GP

Codebase for "Toward AI-Guided Smoking Cessation: Individualized Nicotine Addiction Modeling Using Gaussian Processes" (medRxiv)

Primary LanguagePython

Time series forecasting with Gaussian Processes

Related Publication

The theoretical description of the algorithm implemented in this software and empirical results can be found in:

“Time series forecasting with Gaussian Processes needs priors"
Giorgio Corani, Alessio Benavoli, Marco Zaffalon
Accepted at ECML-PKDD 2021
Arxix preprint: https://arxiv.org/abs/2009.08102

forgp package

The software includes a small package that builds the gaussian process and uses it to produce predictions. The package heavily relies on GPy. A convenience script can be used to run the GP over collections of timeseries.

forecast.py

forecast.py is an executable python script that can be used to produce forecasts and evaluation of our GP over multiple timeseries. The script takes as input a csv file containing training and test series and produces a csv file with predictions and scores. Input and output file are described below. The produced prediction includes mean and upper bound of the 95% confidence interval.

A number of command line arguments can be used to specify custom names for the columns in the input CSV file and to filter the timeseries to be processed. Most useful command line arguments are:

  • --frequency: to include only timeseries with specific frequency
  • --normalize: normalize timeseries using the specified mean and standard deviation
  • --log: verbosity level (100 = max vebosity, 0 = min verbosity) Default: 0
  • --default-priors to use default values for the priors in place place of no prior values
  • --help returns a description for the various command line arguments

Input File format

Our tools uses a simple tabular data format serialized in a CSV file. This CSV file use as the only supported field separator the comma ",". The first line contains the header, while Each following line in the file represents a timeseries. Required fields/columns include:

  • st: unique name of the timeseries
  • period: frequency of the timeseries. One of MONTHLY, QUARTERLY, YEARLY and WEEKLY
  • mean: mean for the normalization of the timeseries
  • std: standard deviation for the normalization of the timeseries
  • x: training values of the timeseries
  • xx: test values of the timeseries

Point values of the timeseries (i.e. x and xx) are to be provided as a semicolon (";") separated list of numeric values.

Output file format

The output file follows a similar format as the input. It stores the predicted point forecasts and 95% upperbounds as semicolon separated lists within a comma separated file where each line represents the corresponding timeseries from the input file. The generated columns include:

  • st: the unique id of the series
  • mean: the mean of the training timeseries
  • std: the standard deviation of the training timeseries
  • center: the mean value of the prediction
  • upper: the upper bound of the 95% confidence prediction band
  • time: time required to fit and predict
  • mae: mean absolute error of the predicted values (the xx values in the input file)
  • crps: continuous ranked probability score
  • ll: loglikelihood

Priors file

Prior values for the different kernel hyper parameters can be provided via file. This file contains the priors as a new line separated list of numbers. These are in order:

  • standard deviation of variances
  • standard deviation of lengtscales
  • mean of variances
  • mean of rbf lenghtscale
  • mean of periodic kernel's lenghtscale
  • mean of first spectral kernel's exponential component lenghtscale
  • mean of first spectral kernel's cosine component lenghtscale
  • mean of second spectral kernel's exponential component lenghtscale
  • mean of second spectral kernel's cosine component lenghtscale

The latter may only be used depending on the selected number of spectral components to be used (Q parameter).

Dependencies and setup

A requirements file is provided in the package to ease the installation of all the dependencies. For conda based systems one may create a suitable environment with:

conda create --name <env> --file requirements.txt

Example execution

The package includes a number of input files, inluding standard M1[1], M3[2] competition timeseries, a sample of the M4 competition[3] and a short example input. To run the script on the example input one may run the following command from withing the src folder:

./forecast.py --log 100 --default-priors --normalize ../data/example_input example_output

As hinted above you can provide priors via file:

./forecast.py --log 100 --default-priors --normalize --priors ../data/example_priors ../data/example_input example_output

References

[1] Makridakis, S., A. Andersen, R. Carbone, R. Fildes, M. Hibon, R. Lewandowski, J. Newton, E. Parzen, and R. Winkler (1982) The accuracy of extrapolation (time series) methods: results of a forecasting competition. Journal of Forecasting, 1, 111--153.

[2] Makridakis, S. and M. Hibon (2000) The M3-competition: results, conclusions and implications. International Journal of Forecasting, 16, 451-476.

[3] Makridakis, S., E. Spiliotis and V. Assimakopoulos (2020) The M4 Competition: 100,000 time series and 61 forecasting methods, International Journal of Forecasting, Elsevier, 36(1), pages 54-74.