manoharan-lab/holopy

Make new API for HoloPy (version 4)

vnmanoharan opened this issue · 1 comments

Proposal

HoloPy version 4 should rely on pymc for inference.

Rationale

When we first included Bayesian inference in HoloPy, the options for efficient gradient-free samplers were limited, and we chose emcee because it was simple to implement and worked well on many of our inference problems. However, we had to write a lot of code to handle priors, likelihoods, and models, and that code is difficult to maintain and to adapt to other kinds of samplers.

pymc has a nice model-specification syntax that includes a variety of distributions that can be used for priors or likelihoods, and it now includes several gradient-free samplers. Previously it had only a Metropolis sampler, Gibbs sampler, and NUTS sampler, but we couldn't take advantage of NUTS because we don't have analytic gradients for most of our scattering modules, and the fortran codes are not auto-differentiable in python. Metropolis and Gibbs are too inefficient for our purposes. But now pymc has a Differential Evolution Metropolis Sampler that, like emcee, is ensemble-based (uses multiple walkers, though the MC moves it makes are different from those in the affine-invariant ensemble sampler used in emcee). pymc also has a Sequential Monte Carlo sampler that is tempered, making it useful for multi-modal posteriors and for calculating marginal likelihoods and Bayes factors for model comparison. These two samplers might suffice for most of our use cases. They seem to work reasonable well in moderate-dimensional spaces (10 parameters or so), which corresponds to many of our models.

In addition, we would get a full model-specification syntax that is similar to that used in other modern statistical modeling probabilistic programming packages, like STAN and NumPyro. We would no longer have to maintain lots of code to handle objects and methods for priors and models.

Downsides

This change will break all existing workflows, and there's no guarantee that the new samplers will work. Also, a lot of code will have be removed, refactored, and (re)written.

Implementation challenges

  • The main challenge is that pymc models are based on PyTensor, which allows for auto-differentiation. Even though we won't make use of auto-differentiation, we still need to define our model in a way that PyTensor can deal with it. There is a nice guide on how to do this for "black-box" likelihoods (models, like ours, that cannot be inspected and differentiated by PyTensor). It involves creating a custom PyTensor Op that runs the black-box code (in our case, the forward model for hologram formation) and returns the log likelihood. Ideally we would have a function that outputs a pymc model, with the appropriate Op set up for calculating the log-likelihood for a forward model that is defined in HoloPy. The user can then augment the model with priors and run samplers using the usual pymc syntax.
  • Optimization is a separate problem than sampling. We have lots of code in HoloPy that attempts to find good initial guesses for sampling, or tries to find the maximum of the posterior to provide a point-estimate for parameters. We use CMAES and scipy.optimize, for example. I think these algorithms can still be used, but it will require connecting them to the logp (log of posterior probability) output from a pymc model. There appears to be a way to calculate the logp after the model is specified. I'm not sure how much overhead this will add to the optimization, but in almost all cases our overhead is dominated by the scattering calculation, not the optimizer. Note also that pm.find_MAP() can also be used and has several options for optimizers.
  • Parallelization may be tricky. Here I'm not talking about parallelizing MC chains, but fitting models to multiple frames in a time-series in parallel (on, for example, a computing cluster). In the past, we've had many issues with making sure objects can be serialized for multiprocessing, and I don't know what will be need to be serialized in order to run multiple pymc samplers in parallel.

Comments are welcome.

Regarding the last point (parallelization), a new package called UM-Bridge has been released that creates a bridge between numerical forward models (like ours) to pymc. It enables the use of pymc on computing clusters. This type of bridge might be the key to complete separating forward modeling (what HoloPy does) from sampling (what pymc and other packages do).

Documentation: https://um-bridge-benchmarks.readthedocs.io/en/docs/
Use in HPC/slurm: https://um-bridge-benchmarks.readthedocs.io/en/docs/umbridge/hpc.html
Example of use with pymc: https://um-bridge-benchmarks.readthedocs.io/en/docs/umbridge/clients.html#pymc-client
Announcement on pymc discourse: https://discourse.pymc.io/t/new-pymcon-talk-enabling-uncertainty-quantification-by-anne-reinarz-linus-seelinger/13583