This is a PyTorch implementation of Generative Adversarial Imitation Learning(GAIL)[1] and Adversarial Inverse Reinforcement Learning(AIRL)[2] based on PPO[3]. I tried to make it easy for readers to understand the algorithm. Please let me know if you have any questions.
You can install Python liblaries using pip install -r requirements.txt
. Note that you need a MuJoCo license. Please follow the instruction in mujoco-py for help.
You can train experts using Soft Actor-Critic(SAC)[4,5]. We set num_steps
to 100000 for InvertedPendulum-v2
and 1000000 for Hopper-v3
. Also, I've prepared the expert's weights here. Please use them if you're only interested in the experiments ahead.
python train_expert.py --cuda --env_id InvertedPendulum-v2 --num_steps 100000 --seed 0
You need to collect demonstraions using trained expert's weight. Note that --std
specifies the standard deviation of the gaussian noise add to the action, and --p_rand
specifies the probability the expert acts randomly. We set std
to 0.01 not to collect too similar trajectories.
python collect_demo.py \
--cuda --env_id InvertedPendulum-v2 \
--weight weights/InvertedPendulum-v2.pth \
--buffer_size 1000000 --std 0.01 --p_rand 0.0 --seed 0
Mean returns of experts we use in the experiments are listed below.
Weight(Env) | std | p_rand | Mean Return(without noise) |
---|---|---|---|
InvertedPendulum-v2.pth | 0.01 | 0.0 | 1000(1000) |
Hopper-v3.pth | 0.01 | 0.0 | 2534(2791) |
You can train IL using demonstrations. We set rollout_length
to 2000 for InvertedPendulum-v2
and 50000 for Hopper-v3
.
python train_imitation.py \
--algo gail --cuda --env_id InvertedPendulum-v2 \
--buffer buffers/InvertedPendulum-v2/size1000000_std0.01_prand0.0.pth \
--num_steps 100000 --eval_interval 5000 --rollout_length 2000 --seed 0
[1] Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in neural information processing systems. 2016.
[2] Fu, Justin, Katie Luo, and Sergey Levine. "Learning robust rewards with adversarial inverse reinforcement learning." arXiv preprint arXiv:1710.11248 (2017).
[3] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
[4] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
[5] Haarnoja, Tuomas, et al. "Soft actor-critic algorithms and applications." arXiv preprint arXiv:1812.05905 (2018).