/GParML

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models.

Primary LanguagePythonOtherNOASSERTION

This package contains the Python code used for the paper ``Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models''. The supplementary material is given in the file ``Variational Inference in the Gaussian Process Latent Variable Model and Sparse GP Regression - Supplementary Material.pdf''.

The core and most important part of the code is the partial_terms.py file, implementing the partial sums presented in the paper and the tutorial, with many references to the equations in the tutorial (please note that the equation numbers need updating as additional equations were introduced to the tutorial for clarity as part of the review process). This file contains only 583 lines of Python.

The inference can be run sequentially using the simple example files (containing roughly 300 lines of code) where different optimisers are used (gd (gradient descent), scg (scaled conjugate gradient), and scg_adapted which has been optimised to use less function evaluations). These are:
gd-example.py
scg-example.py
scg_adapted-example.py

The parallel inference was implemented across several files as it was built to be modular and extendible. These are:
parallel_GPLVM.py
local_MapReduce.py
SGE_MapReduce.py (not up to date)
supporting_functions.py
These files implement sanity checks as well for different inputs and were used to run the experiments.

Unit tests are provided in test.py (although the generation of data for the tests often causes underflows and overflows). These implement finite differencing tests for the different functions in partial_terms.py as well as quantitative comparisons to GPy.

To run the inference (for GPLVM), create a new folder ('test') containing sub-folders 'inputs', 'embeddings', 'statistics', and 'tmp'. inputs contains (rather confusingly -- this will be changed in future versions) the observed outputs for the GPLVM, while embeddings contains the embeddings and variance files (which will be initialised by default using PCA). There are many options available for the inference which can be inspected using the command:

python parallel_GPLVM.py --help

These are given at the end of this document. To run inference in a minimal way for 5 iterations over a 4D dataset using 2 inducing points the following line can be used:

python parallel_GPLVM.py -i ./test/inputs/ -e ./test/embeddings/ --statistics ./test/statistics/ --tmp ./test/tmp/ -k -T 5 -M 2 -Q 2 -D 4

To run the profiler the following command can be used:
python -m cProfile -s cumtime test_parallel_gpLVM.py > profiler_test_parallel_gpLVM.txt

The file sizes for the provided code is as follows:
    353 gd-example.py
    105 gd_local_MapReduce.py
    127 gd.py
    205 kernel_exp.py
    191 kernels.py
    409 local_MapReduce.py
    113 nputil.py
    510 parallel_GPLVM.py
    583 partial_terms.py
    178 predict.py
     30 pre_process.py
    352 scg_adapted-example.py
    243 scg_adapted_local_MapReduce.py
    314 scg_adapted.py
    363 scg-example.py
    146 scg.py
    453 SGE_MapReduce.py
    169 supporting_functions.py
    301 test.py
   6102 total




The documentation for the code is as follows:

parallel_GPLVM.py
Main script to run, implements parallel inference for GPLVM for SGE (Sun Grid
Engine), Hadoop (Map Reduce framework), and a local parallel implementation.

Arguments:
-i, --input
    Folder containing files to be processed. One file will be processed per node. Files assumed to be in a comma-separated-value (CSV) format. (required))
-e, --embeddings
    Existing folder to store embeddings in. One file will be created for each input file. (required)
-p, --parallel
    Which parallel architecture to use (local (default), Hadoop, SGE)
-T, --iterations
    Number of iterations to run; default value is 100
-s, --statistics
    Folder to store statistics files in (default is /tmp)
-k, --keep
    Whether to keep statistics files or to delete them
-l, --load
    Whether to load statistics and embeddings from previous run or initialise new ones
-t, --tmp
    Shared folder to store tmp files in (default is /scratch/tmp)
--init
    Which initialisation to use (PCA (default), PPCA (probabilistic PCA), FA (factor analysis), random)
--optimiser
    Which optimiser to use (SCG_adapted (adapted scaled gradient descent - default), GD (gradient descent))
--drop_out_fraction
    Fraction of nodes to drop out  (default: 0)

Sparse GPs specific options
-M, --inducing_points
    Number of inducing points (default: 10)
-Q, --latent_dimensions
    umber of latent dimensions (default: 10)
-D, --output_dimensions
    Number of output dimensions given in Y (default value set to 10)
--fixed_embeddings
    If given, embeddings (X) are treated as fixed. Only makes sense when embeddings are given in the folder in advance
--fixed_beta
    If given, beta is treated as fixed.

SGE specific options
--simplejson
    SGE simplejson location

Hadoop specific options
--hadoop
    Hadoop folder
--jar
    Jar file for Hadoop streaming