Posterior Sampling for Reinforcement Learning

Replica of results from the paper that introduces Posterior Sampling for Reinforcement Learning (PSRL) algorithm.

Osband, I., Russo, D., & Van Roy, B. (2013). (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26.

The current codebase supports the following RL environemnts:

Installation

Create conda environment

cd psrl/
conda create --name psrl python=3.9
conda activate psrl

Install requirements

pip install -r requirements.txt
pip install -e .

Running experiments

To replicate all plots first run the optimization process for each agent and environment

python scripts/generate_data.py --config configs/riverswim_psrl.yaml --seed 0

This script will produce files agent.pkl and trajectories.pkl which store the trained parameters of the optimized agent and the trajectories taken in the environment throughout the execution of the program. Choose between any of the configuration files in config folder to generate data specific for each experiment.

The most straightforward way to obtain all data necessary for plots is to just run the following script

. run_parallel.sh

which launches all combinations of environments (riverswim, tworoom, fourroom), agents (psrl, ucrl, kl_ucrl), and seeds (10 in total, starting at 0) using screen.

After all runs come to an end, you can obtain regret plots by running

python scripts/plot_regret.py --config configs/regret_riverswim.yaml

Switch between the following configs to obtain a regret plot for each environment:

configs/regret_riverswim.yaml
configs/regret_tworoom.yaml
configs/regret_fourroom.yaml

With configs/regret_riverswim.yaml you should expect the following plot

Likewise, with a single run you can obtain agent-specific plots for gridworld environments by running

python scripts/plot_agent.py --config configs/tworoom_klucrl.yaml

Choose the right configuration to obtain a set of plots for any particular run. You should obtain all the following plots:

Action-value function
Empirical state visitation
Empirical total reward
Expected reward
Policy
State-value function

For configs/tworoom_klucrl.yaml (after setting no_goal=False) you should expect the following

Action-value function
Empirical state visitations
Empirical total reward
Expected reward
Policy
State-value function

For configs/fourroom_klucrl.yaml (after setting no_goal=False) you should expect the following

Action-value function
Empirical state visitations
Empirical total reward
Expected reward
Policy
State-value function

Disclaimer

This project includes multiple other scripts that are undocumented. These were meant for a research project that was left unfinished, so they do not directly connect to the original paper. Likewise, there is no guarantee that results obtained from them produce any meaningful output yet.

csalcedo001/psrl

Posterior Sampling for Reinforcement Learning

Installation

Running experiments

Disclaimer