This repository contains the implementation for the paper "Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings" in Python.
Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy’s value in infinite horizon settings where the number of decision points diverges to infinity. We provide inferential tools for
- the value under a fixed policy in off-policy settings (Section 3.1);
- the value under an unknown optimal policy in off-policy settings (Section 3.2);
- the value under an unknown optimal policy in on-policy settings (Section 4);
- difference between the value under an unknown optimal policy and that under the behavior policy (Appendix B.2).
For 1, we compare with the double reinforcement learning method (Kallus and Uehara, 2019) and find the proposed SAVE method achieve better finite sample performance (see below) in settings where parametric-rate estimation of the value is feasible.
For 2--4, we allow the setting to be nonregular where the optimal policy is not unique. We apply our method to a dataset from mobile health applications and find that reinforcement learning algorithms could help improve patient's health status. See the figure that depicts the CI of the value difference below.
- Python3
- sklearn==0.21.2
- gym ## for Cliffwalk experiment
-
src
- AGENT.py : Main object for SAVE model
- utility.py: utility functions
- exp.py: Experiment for Value inference for fixed initial state under behavior policy (Not included in paper)
- exp_int.py: Experiment for Value inference for integrated initial state under behavior policy
- exp_est_pol.py: Experiment for Value inference for fixed initial state under estimated policy
- exp_est_pol_int.py: Experiment for Value inference for integrated initial state under estimated policy
-
Ohio_data: source : http://smarthealth.cs.ohio.edu/OhioT1DM-dataset.html
-
cliffwalking_exp: experiment on cliffwalking
Check test.ipynb
from src.exp_est_pol import *
main_realdata(patient = 0, reward_dicount = 0.5, S_init_time = 396)
The DRL is modified from original code in cliffwalking_exp/cw_notebook_ver_splitting.ipynb
Our experiment is conducted in cliffwalking_exp/SAVE_cliffwalking.ipynb