This is a private fork intended for my own personal use. Please go to https://github.com/JamesYang007/autoppl to view the most up to date info on this project.
AutoPPL is a C++ template library providing high-level support for probabilistic programming. Using operator overloading and expression templates, AutoPPL provides a generic framework for specifying probabilistic models and applying inference algorithms.
See examples for more detail on how to use these sampling algorithms.
In Bayesian statistics, the first step in performing probabilistic inference is to specify a generative model for the data of interest. For example, in mathematical notation, a Gaussian model could look like the following:
X ~ N(mu, I)
mu ~ U(-1, 1)
Given its simplicity and expressiveness, we wanted to mimic this compact notation as much as possible for users of our library.
For the model specified above, the code in AutoPPL could look like the following:
ppl::Data<double> X;
ppl::Param<double> mu;
auto model = (
mu |= ppl::uniform(-1., 1.),
X |= ppl::normal(mu, 1)
);
A ppl::Param<T>
stores a univariate parameter of our model, while ppl::Data
holds samples
that have been observed from our distribution. The goal of Bayesian statistics is to sample from
the posterior distribution given observed data, or maximize the likelihood of our data over a set
of possible parameters. We support these methods using advanced modeling and sampling algorithms.
We make the assumption that users are able to specify the probabilistic model at compile-time. As a result, AutoPPL can construct all model expressions at compile-time simply from the type information using expression templates. A model expression is simply a binary tree that relates variables with distributions, which can potentially depend on other variables.
As an example, our model expression such as model
in the previous
section is a compile-time constructed (binary) tree
storing information about the relationship between m, X
and their (conditional) distributions.
A model object makes no heap-allocations and is extremely minimal in size.
It is simply a small, contiguous slab of memory representing the binary tree.
The model
object in the previous section
is about 48 bytes
on x86_64-apple-darwin17.7.0
using clang-10
.
For a more complicated model such as the following:
ppl::Data<double> X;
std::array<ppl::Param<double>, 6> theta;
auto model = (
theta[0] |= ppl::uniform(-1., 1.),
theta[1] |= ppl::uniform(theta[0], theta[0] + 2.),
theta[2] |= ppl::normal(theta[1], theta[0] * theta[0]),
theta[3] |= ppl::normal(-2., 1.),
theta[4] |= ppl::uniform(-0.5, 0.5),
theta[5] |= ppl::normal(theta[2] + theta[3], theta[4]),
X |= ppl::normal(theta[5], 1.)
);
The size of the model is 192 bytes
on the same machine and compiler.
A model expression simply references the variables used in the expression
such as theta[0], theta[1], ..., theta[5], X
.
The model does not store any data, but rather only serves to relate variables.
Users interface with the inference methods via model expressions and other configuration parameters for that particular method. Hence, the inference algorithms are completely general and work with any model so long as the model expression is properly constructed. Moreover, due to the statically-known model specification, algorithms have opportunities to make compile-time optimizations. See Examples for more detail. Drawing 100 samples from the posterior of a hierarchical Gaussian model using a NUTS sampler, we see significant gains over STAN and PyMC3 in both compilation and runtime.
Y ~ N(l1 + l2, sigma)
l1 ~ N(0, 10)
l2 ~ N(0, 10)
sigma ~ Uniform(0, 20)
Compilation time: 3.39s
Run time: 0.987008s
Compilation time: 40.48s
Run time: 0.613672s (Warm-up)
0.704271s (Sampling)
1.31794s (Total)
Run time: 38s
First, clone the repository:
git clone --recurse-submodules https://github.com/JamesYang007/autoppl
To build and run tests, run the following:
./setup.sh
./clean-build.sh debug
cd build/debug
ctest
Although AutoPPL
was designed to perform inference on posterior distributions,
one can certainly use it to sample from any joint distribution defined by the priors and conditional distributions.
As an example, we can sample 1000
points from a
standard normal distribution using Metropolis-Hastings in the following way:
std::array<double, 1000> samples;
ppl::Param<double> theta {samples.data()};
auto model = (
theta |= ppl::normal(0., 1.)
);
ppl::mh(model, 1000);
Note that the parameter theta
is bound to the storage samples
at construction.
After calling mh
, the 1000
samples are placed into samples.
In general, so long as the joint PDF is known, or equivalently and more commonly if the conditional and prior PDFs are known, one can sample from the distribution.
As another example, we may sample from a more complicated joint distribution:
std::array<double, 1000> theta1_samples;
std::array<double, 1000> theta2_samples;
ppl::Param<double> theta1 {theta1_samples.data()};
ppl::Param<double> theta2 {theta2_samples.data()};
auto model = (
theta1 |= ppl::uniform(-1., 1.),
theta2 |= ppl::normal(theta1, 1.)
);
ppl::mh(model, 1000);
The following is an example of fitting a Gaussian model to some data. We put a Normal(0,3) prior on the mean and Uniform(0,2) prior on the standard deviation. It is recommended to use the state-of-the-art NUTS sampler to sample from a posterior distribution.
std::array<double, 1000> mu_samples, sigma_samples;
ppl::Data<double> x {1.0, 1.5, 1.7, 1.2, 1.5};
ppl::Param<double> mu {mu_samples.data()};
ppl::Param<double> sigma {sigma_samples.data()};
auto model = (
mu |= ppl::normal(0., 3.),
sigma |= ppl::uniform(0., 2.),
x |= ppl::normal(mu, sigma)
);
size_t warmup = 1000;
size_t n_samples = 1000;
size_t n_adapt = 1000;
ppl::nuts(model, warmup, n_samples, n_adapt);
This example covers ordinary linear regression in a Bayesian setting.
We created a fictitious dataset consisting of (x,y)
coordinates.
The true relationship is the following: y = x + 1
.
By specifying two parameters for the weight and bias, we propose
the following probabilistic model:
y ~ N(wx + b, 0.5)
w ~ U(0, 2)
b ~ U(0, 2)
In AutoPPL, we can write the following code and sample from the posterior:
std::array<double, 1000> w_storage;
std::array<double, 1000> b_storage;
ppl::Data<double> x{2.4, 3.1, 3.6, 4, 4.5, 5.};
ppl::Data<double> y{3.5, 4, 4.4, 5.01, 5.46, 6.1};
ppl::Param<double> w {w_storage.data()};
ppl::Param<double> b {b_storage.data()};
auto model = (
w |= ppl::uniform(0, 2),
b |= ppl::uniform(0, 2),
y |= ppl::normal(x * w + b, 0.5)
);
size_t warmup = 1000;
size_t n_samples = 1000;
size_t n_adapt = 1000;
ppl::nuts(model, warmup, n_samples, n_adapt);
James Yang | Jacob Austin | Jenny Chen | Lucie Le Blanc |
---|---|---|---|
github.com/JamesYang007 |
github.com/jacobaustin123 |
github.com/jenchen1398 |
github.com/lucieleblanc |
Many thanks to the following third party tools used in this project:
- Armadillo: matrix library used for inference algorithms.
- Clang: one of the main compilers used.
- CMake: build system.
- Coveralls: check test coverage.
- Cpp Coveralls: check test coverage specifically for C++ code.
- FastAD: automatic differentiation library.
- GCC: one of the main compilers used.
- Google Benchmark: benchmark library algorithms.
- GoogleTest: unit/integration-tests.
- Travis CI: continuous integration for Linux using GCC.
- Valgrind: check memory leak and errors.