/P3O

P3O paper code

Primary LanguagePython

P3O: Policy-on Policy-off Policy Optimization

On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is highly effective in reducing the sample complexity of state-of-the-art algorithms.

This repository provides the MXNet implementation of P3O: Policy-on Policy-off Policy Optimization. If you use this code please cite the paper using the bibtex reference below.

@inproceedings{fakoorp3o,
  author    = {Rasool Fakoor and
               Pratik Chaudhari and
               Alexander J. Smola},
  title     = {{P3O:} Policy-on Policy-off Policy Optimization},
  booktitle = {Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial
               Intelligence, {UAI} 2019},
  pages     = {371},
  year      = {2019},
  crossref  = {DBLP:conf/uai/2019},
}

Getting Started

On Ubuntu 16.04, run the following script to install the environment dependencies:

sudo apt-get install -y libsm6 libxrender1 libfontconfig1

wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh && \ 
bash Anaconda3-2018.12-Linux-x86_64.sh && \
source ~/.bashrc && conda update -y conda && conda update -y anaconda

conda create -n gluonrl python=3.7.1 anaconda && conda activate gluonrl && \
conda install -y -n gluonrl -c conda-forge pyhamcrest

pip install gym && conda install -y -n gluonrl -c ska pygtk && \
pip install pyopengl opencv-python gym[atari] mxnet

Set the following environmental variables for reproducible experiments:

export MXNET_CUDNN_AUTOTUNE_DEFAULT=0
export MXNET_ENFORCE_DETERMINISM=1
export OMP_NUM_THREADS=1

Usage

python -u main.py --use_linear_lr_decay --use_ess_is_clipping --frames_waits 15000 --sample_mult 6 --num_steps 16 --num_env 16 --save_freq 500 --log_interval 40 --replay_ratio 2 --replay_size 50000 --log_id log_0 --ent_coef 0.01 --seed 0 --env=BreakoutNoFrameskip-v4 --alg_name p3o --use_gae 

'env' can be any of 49 atari games. The codes can be used either on GPU or CPU machine. For the experiments in this paper, we used c5.18xlarge .

For complete list of hyperparameters, please refer to the paper appendix.

Contact

Please open an issue on issues tracker to report problems or to ask questions or send an email to me, Rasool Fakoor (Email : my first name followed by a dot followed by my last name at mavs dot uta dot edu).

Acknowledgement

  • Special thanks to Hang Zhang and Tong He for their helps and tireless efforts with MXNet implementation.
  • Vectorized environment generation such as Atari and MujoCo, environment wrappers, monitoring, logging, etc are based/copied on/from OpenAI Baselines. p3o/oailibs contains related codes to OpenAI Baselines.