This is a set of tools for evaluating Bayesian models, together with benchmark implementations and results.
Motivations:
- There is a lack of standardized tasks that meaningfully assess the quality of uncertainty quantification for Bayesian black-box models.
- Variations between tasks in the literature make a direct comparison between methods difficult.
- Implementing competing methods takes considerable effort, and there little incentive to do a good job.
- Published papers may not always provide complete details of implementations due to space considerations.
Aims:
- Curate a set of benchmarks that meaningfully compare the efficacy of Bayesian models in real-world tasks.
- Maintain a fair assessment of benchmark methods, with full implementations and results.
Tasks:
- Classification and regression
- Density estimation (real world and synthetic) (TODO)
- Active learning
- Adversarial robustness (TODO)
Current implementations:
- Sparse variational GP, for Gaussian and non-Gaussian likelihoods
- Sparse variational GP, with minibatches
- 2 layer Deep Gaussian process, with doubly-stochastic variational inference A variety of sklearn models
Coming soon: