Priesemann-Group/covid19_inference

Add stein variational gradient descent as sampling possibility, for increased performance

Opened this issue · 6 comments

In PyMC3, stein variational gradient is already implemented, but it has to be tested how well it works/how biased it is for small number of particles. In addition, the optimal type of optimizer and learning rate has to found out. Eventually it could be tried to reparametrize the model, to get a simpler posterior
Reference for the bias: https://ferrine.github.io/blog/2017/06/04/asvgd-sanity-check/

This is a 6 page short summary on Stein Variational Gradient Descent:
http://www.cs.utexas.edu/~lqiang/PDF/svgd_aabi2016.pdf

This short paper gives a brief overview of the key idea of SVGD, and outline several directionsfor future work, including a new theoretical framework that interprets SVGD as a natural gradientdescent of the KL divergence functional on a Riemannian-like metric structure on the space ofdistributions, and extensions of SVGD that allow us to train neural networks to draw approximatesamples from given distributions, and develop new adaptive importance sampling methods withoutassuming parametric forms on the proposals.

In principle one can use the ASVGD inference, however, one might have to play with the priors a bit to help the algorithm converge. Also, it is not yet recommended to use. What helped in my case to make it converge more robustly:

  • mean prior for lambda_0 of 0.6.
  • prior for I_0 of around 50.
  • adam with learning rate 0.02
  • 300 steps at least .. which takes around 6 minutes in a google colab notebook

If you know how to apply SVGD or a similar VI to boost computing time of the posteriors, let us know here.

General comment: If you are hell bent on using variational methods, consider looking at pyro since that framework originally was conceived for exactly that. SVGD exists, but the implementation claims to be 'basic'.

https://pyro.ai/
http://pyro.ai/examples/svi_part_i.html

Any particular reason why variational stuff should be better? People use it to train bayesian neural networks and with those number of parameters sampling becomes unfeasible. But given the few parameters of the current model that should not be a problem.

We are not hell bent on it. My thought was that variational methods could eventually be parallelized, when we will look at the level of Landkreise.
But that I find that interesting, intrinsically are Monte-Carlo methods scaling worse than variational methods with the number of parameters? My take was that the advanced methods like Hamiltonian MC scale pretty well with the number of parameters. But my knowledge is pretty superficial.

One this issue, no one is actively working on. So if someone want to have a look...
Things learned so far:

  • the pymc3.ASVGD needs very high temperature, around 2 to give more or less the correct posterior, one the 1 dimensional problem.
  • It takes time until ASVGD and SVGD converges, mainly probably because of wide distribution of I_begin. Learning rates around 0.01 seems to work.

Next steps would be use pymc3.SVGD with about 100 particles and try to apply it to the example_bundeslaender, to see whether one gets faster to approximate good results. There one can also test whether theano uses multiprocessing.

So basically how can people help @jdehning

A different question
how do we infer these rates for different countries?
Are they universal?