/DEHRL

Diversity−Driven Extensible Hierarchical Reinforcement Learning. AAAI 2019.

Primary LanguagePythonMIT LicenseMIT

DEHRL

This repository provides implementation of domains, DEHRL and results visualization for paper:

The paper is published on Proceedings of the 33rd National Conference on Artificial Intelligence (AAAI 2019), as an Oral Presentation. The work is conducted by Intelligent Systems Lab @ Computer Science in University of Oxford.

Specifically, this repository includes simple guidelines to:

If you find our code or paper useful for your research, please cite:

@inproceedings{SWLXX-AAAI-2019,
  title = "Diversity-Driven Extensible Hierarchical Reinforcement Learning",
  author = "Yuhang Song and Jianyi Wang and Thomas Lukasiewicz and Zhenghua Xu and Mai Xu",
  booktitle = "Proceedings of the 33rd National Conference on Artificial Intelligence (AAAI 2019)",
  year = "2019",
  month = "January",
  publisher = "AAAI Press",
}

Note that the project is based on an RL implemention by Ilya Kostrikov, we thank a lot for his contribution to the community.

Domains

All domains share the same interface as OpenAI Gym. See overcooked.py, minecraft.py, gridworld.py for more details. Note that MineCraft is based on the implemention by Michael Fogleman, we thank a lot for his contribution to the community. Besides, be aware that MineCraft only supports single thread training and it requires the xserver.

Setup an environment and run DEHRL

In order to install requirements, follow:

# create env
conda create -n dehrl python=3.6.7 -y

# source in env
source ~/.bashrc
source activate dehrl

pip install --upgrade torch torchvision
pip install --upgrade tensorflow
pip install opencv-contrib-python
conda install scikit-image -y
pip install --upgrade imutils
pip install gym[atari]

mkdir dehrl
cd dehrl
git clone https://github.com/YuhangSong/DEHRL.git
mkdir results
cd DEHRL
pip install requirements.txt

Run commands here to enter the virtual environment before proceeding to following commands:

source ~/.bashrc
source activate dehrl

Run OverCooked

The domain is somehow inspired by a game in the same name (though significantly different and coded by ourselves):

We thank them a lot for this inspiration.

Level: 1

Goal-type: any.

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "OverCooked" --reward-level 1 --setup-goal any --new-overcooked --num-hierarchy 2 --num-subpolicy 5 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 0.1875 --distance mass_center --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Goal-type: fix

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "OverCooked" --reward-level 1 --setup-goal fix --new-overcooked --num-hierarchy 2 --num-subpolicy 5 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 0.1875 --distance mass_center --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Goal-type: random

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "OverCooked" --reward-level 1 --setup-goal random --new-overcooked --num-hierarchy 2 --num-subpolicy 5 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 0.1875 --distance mass_center --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Level 2

Goal-type: any

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "OverCooked" --reward-level 2 --setup-goal any --new-overcooked --num-hierarchy 3 --num-subpolicy 5 5 --hierarchy-interval 4 12 --num-steps 128 128 128 --reward-bounty 1 --distance mass_center --transition-model-mini-batch-size 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Goal-type: fix

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "OverCooked" --reward-level 2 --setup-goal fix --new-overcooked --num-hierarchy 3 --num-subpolicy 5 5 --hierarchy-interval 4 12 --num-steps 128 128 128 --reward-bounty 1 --distance mass_center --transition-model-mini-batch-size 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Goal-type: random

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "OverCooked" --reward-level 2 --setup-goal random --new-overcooked --num-hierarchy 3 --num-subpolicy 5 5 --hierarchy-interval 4 12 --num-steps 128 128 128 --reward-bounty 1 --distance mass_center --transition-model-mini-batch-size 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Following curves are produced at commit point 7ea3aec9eabcd421d5660042d3e50333454f928e.

Level 1; Any Level 1; Fix Level 1; Random
Level 2; Any Level 2; Fix Level 2; Random

If you cannot reproduce above curves, checkout to above points and run following commands again. In the mean time, open an issue immediately, since we believe we are introducing unintentional changes to the code when adding more domains to present.

Reload Model and Data from a Checkpoint

The code antomatically check and reload everything from the log dir you set, and everything (model/curves/videos, etc.) is saved every 10 minutes. Thus, please feel save to continue your experiment from where you stopped by simply typing in the same command.

Run MineCraft

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 1 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "MineCraft" --num-hierarchy 4 --num-subpolicy 8 8 8 --hierarchy-interval 4 4 4 --num-steps 128 128 128 128 --reward-bounty 1 --distance l1 --transition-model-mini-batch-size 64 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 10 --aux r_0

We also save the world build by the agent, set

window.loadWorld('<path-to-the-save-file>.sav')

to the path of .sav file and run

python replay.py

to have a third person view of the built world.

Run Atari

Game: MontezumaRevengeNoFrameskip-v4

The agent can reach the key when the skull is not presented, video is here.

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "MontezumaRevengeNoFrameskip-v4" --num-hierarchy 3 --num-subpolicy 5 5 --hierarchy-interval 8 4 --num-steps 128 128 128 --reward-bounty 1 --distance mass_center --transition-model-mini-batch-size 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

But due to the skull is affecting the mass centre of the state significantly, our method fails when the skull is moving in the state. Necessary solution is to generate a mask that can locate the part of the state that is controlled by the agent. Luckily, a recent paper under review by ICLR 2019 solves this problem. But their code is not released yet, so we implement their idea to generate the mask.

However, their idea works ideally on simple domain like GridWorld. Run following command to see the learnt masks on GridWorld,

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "GridWorld" --num-hierarchy 2 --num-subpolicy 5 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 1 --distance mass_center --transition-model-mini-batch-size 64 --inverse-mask --num-grid 7 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Following video is produced at commit point 04baecee316234a7f3fdcd51d1908c971211fdce.

If you cannot reproduce above curves, checkout to above points and run following commands again. In the mean time, open an issue immediately, since we believe we are introducing unintentional changes to the code when adding more domains to present.

But it performs poorly on MontezumaRevenge. We are waiting for their official release to further advance our DEHRL on domains with uncontrollable objects.

Continuous Control

Explore2DContinuous

Number of subpolicies: 4

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 3e-4 --clip-param 0.1 --actor-critic-epoch 10 --entropy-coef 0 --value-loss-coef 1 --gamma 0.99 --tau 0.95 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "Explore2DContinuous" --episode-length-limit 32 --num-hierarchy 2 --num-subpolicy 4 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0 --log-behavior

Number of subpolicies: 8

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 3e-4 --clip-param 0.1 --actor-critic-epoch 10 --entropy-coef 0 --value-loss-coef 1 --gamma 0.99 --tau 0.95 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp code_release --obs-type 'image' --env-name "Explore2DContinuous" --episode-length-limit 32 --num-hierarchy 2 --num-subpolicy 8 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0 --log-behavior
Number of subpolicies: 4 Number of subpolicies: 8

where the dot is current position and crosses are resulted states of different subpolicies.

PyBullet

Pybullet is a free alternative for Mujoco, with even better / more complex continuous control tasks. Install by pip install -U pybullet.

AntBulletEnv-v0.

Number of subpolicies: 4

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 3e-4 --clip-param 0.1 --actor-critic-epoch 10 --entropy-coef 0 --value-loss-coef 1 --gamma 0.99 --tau 0.95 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 10 --exp AntBulletEnv-0 --obs-type 'image' --env-name "AntBulletEnv-v1" --num-hierarchy 2 --num-subpolicy 4 --hierarchy-interval 128 --extend-drive 1 --num-steps 2048 2048 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 --train-mode together --unmask-value-function --summarize-behavior-interval 15 --aux r_3 --summarize-rendered-behavior --summarize-state-prediction
Rendered Observations Resulted States of Different Sub-policies

where the dot is current position and crosses are resulted states of different subpolicies.

Note that for above PyBullet domains

  • We manually extract position information from observation, and use it to do diverse-driven. This is due to we leave extracting useful information from observation as a future work, and mainly focus on verifying our diverse-drive solution.
  • We remove the reward returned by the original enviriment so that the experiment is aimed to let DEHRL discover diverse subpolicies in an unsupervise manner. Of course extrinsic reward can be included, but the goal set out by current environment is too simply to show the effectiveness of diverse sub-policies.
  • Thus, the part of experiment is only to show our method scale well to continuous control tasks, futher investigation is a promising future work.

Other available environments in PyBullet can be found here.

Run Explore2D

Following results on Explore2D are produced at 9c2c8bbe5a8e18df21fbda1183e3e2a229a340d1.

Random policy baseline

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp 2d_explore --obs-type 'image' --env-name "Explore2D" --episode-length-limit 1024 --num-hierarchy 2 --num-subpolicy 5 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 0 --distance l2 --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Run with 5 subpolicies

Number of levels: 2

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp 2d_explore --obs-type 'image' --env-name "Explore2D" --episode-length-limit 1024 --num-hierarchy 2 --num-subpolicy 5 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Number of levels: 3

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp 2d_explore --obs-type 'image' --env-name "Explore2D" --episode-length-limit 1024 --num-hierarchy 3 --num-subpolicy 5 5 --hierarchy-interval 4 4 --num-steps 128 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Number of levels: 4

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp 2d_explore --obs-type 'image' --env-name "Explore2D" --episode-length-limit 1024 --num-hierarchy 4 --num-subpolicy 5 5 5 --hierarchy-interval 4 4 4 --num-steps 128 128 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Visualize Explore2D: replace main.py to vis_explore2d.py in corresponding command.

Level 1 Level 2 Level 3 Level 4

Run with 9 subpolicies

Number of levels: 2

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp 2d_explore --obs-type 'image' --env-name "Explore2D" --episode-length-limit 1024 --num-hierarchy 2 --num-subpolicy 9 --hierarchy-interval 4 --num-steps 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Number of levels: 3

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp 2d_explore --obs-type 'image' --env-name "Explore2D" --episode-length-limit 1024 --num-hierarchy 3 --num-subpolicy 9 9 --hierarchy-interval 4 4 --num-steps 128 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0

Number of levels: 4

CUDA_VISIBLE_DEVICES=0 python main.py --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 1 --num-processes 8 --actor-critic-mini-batch-size 256 --actor-critic-epoch 4 --exp 2d_explore --obs-type 'image' --env-name "Explore2D" --episode-length-limit 1024 --num-hierarchy 4 --num-subpolicy 9 9 9 --hierarchy-interval 4 4 4 --num-steps 128 128 128 128 --reward-bounty 1 --distance l2 --transition-model-mini-batch-size 64 64 64 --train-mode together --clip-reward-bounty --clip-reward-bounty-active-function linear --summarize-behavior-interval 5 --aux r_0
Level 1 Level 2 Level 3 Level 4

Results Visualization

The code log multiple curves to help analysis the training process, run:

tensorboard --logdir ../results/code_release/ --port 6009

and visit http://localhost:6009 for visualization with tensorboard.

Besides, if you add argument --log-behavior, multiple videos are saved in the logdir. However, this will cost extra memory, so be careful when you are using it.

Meet some problems?

Please feel free to contact us or open an issue (or a pull request if you already solve it, cheers!), we will keep tracking of the issues and updating the code.

Hope you have fun with our work!