Portfolio Optimization with Deep Bayesian Bandits
AI Portfolio Manager - optimizing distribution of asset allocation
by means of reinforcement learning.
Implementation of Linear Full Posterior Bandits for portfolio optimization
This corresponds to the Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling paper, published in ICLR 2018.
@article{riquelme2018deep, title={Deep Bayesian Bandits Showdown: An Empirical
Comparison of Bayesian Deep Networks for Thompson Sampling},
author={Riquelme, Carlos and Tucker, George and Snoek, Jasper},
journal={International Conference on Learning Representations, ICLR.}, year={2018}}
Installation
WIP
Usage
WIP
Contextual Bandits
Contextual bandits are a rich decision-making framework where an algorithm has to choose among a set of k actions at every time step t, after observing a context (or side-information) denoted by Xt. The general pseudocode for the process if we use algorithm A is as follows:
At time t = 1, ..., T:
1. Observe new context: X_t
2. Choose action: a_t = A.action(X_t)
3. Observe reward: r_t
4. Update internal state of the algorithm: A.update((X_t, a_t, r_t))
The goal is to maximize the total sum of rewards: ∑t rt
Thompson Sampling
Thompson Sampling is a meta-algorithm that chooses an action for the contextual bandit in a statistically efficient manner, simultaneously finding the best arm while attempting to incur low cost. Informally speaking, we assume the expected reward is given by some function E[rt | Xt, at] = f(Xt, at). Unfortunately, function f is unknown, as otherwise we could just choose the action with highest expected value: at* = arg maxi f(Xt, at).
The idea behind Thompson Sampling is based on keeping a posterior distribution πt over functions in some family f ∈ F after observing the first t-1 datapoints. Then, at time t, we sample one potential explanation of the underlying process: ft ∼ πt, and act optimally (i.e., greedily) according to ft. In other words, we choose at = arg maxi ft(Xt, ai). Finally, we update our posterior distribution with the new collected datapoint (Xt, at, rt).
The main issue is that keeping an updated posterior πt (or, even, sampling from it) is often intractable for highly parameterized models like deep neural networks. The algorithms we list in the next section provide tractable approximations that can be used in combination with Thompson Sampling to solve the contextual bandit problem.