/robust_gaifo

Primary LanguageJupyter Notebook

Implementation of robust GAIL under transition dynamics mismatch

Credits

Our code is based on this open source GAIL implementation: https://github.com/Khrylx/PyTorch-RL

Installation

You can replicate the virtual environment we used in our experiments with the file requirements.txt

Run the experiments

The commands we present here must be executed from the folder code.

The first step is to train an expert in the nominal environment. We offer the possibility to do this using PPO on any environment that offers a GYM interface. Running

python run_ppo.py --env-name <environment name> --max-iter-num 3000 --save-model-interval 500

The model is saved in the folder code/assets/learned_model by default. However you can change the destination changing the variable subfolder in examples/ppo_gym.py. Then you can save trajectories with the command:

python gail/save_expert_traj.py --env-name <environment name> --max-expert-state-num 1000 --model-path <path-to-the-PPO-model>

In case of gridworld environments, you can save the trajectories computed by Value Iteration with the command:

python gail/value_iteration_gym.py --max-expert-state-num 3000 --noiseE 0.0 --grid-type 1

The trajectories are saved by default in the folder assets/env<environment-name>/expert_traj. However, you can change that name editing the file gail/save_expert_traj.py

Run Robust GAIFO with Slurm:

python run_experiment.py --env-name <environment-name> --algo gaifo --learning-rate 1e-4 --alpha 1.0 0.999 0.99 0.98 0.97 0.96 0.95 0.9 --num-threads 1 --min-batch-size 3000 --max-iter-num 1000 --log-interval 1 --save-model-interval 1 --expert-traj-path <path-to-trajectories> --seed 0 1 2

As a concrete example:

 python run_experiment.py --env-name InvertedDoublePendulum-v2 --algo gaifo --learning-rate 1e-4 
--alpha 1.0 0.999 0.99 0.98 0.97 0.96 0.95 0.9 --num-threads 1 --min-batch-size 3000 --max-iter-num 500 
--log-interval 1 --save-model-interval 1 --mass-mulL 0.5 0.75 1.0 1.25 1.5 2.0  --mass-mulE 1.0 
--expert-traj-path assets/envInvertedDoublePendulum-v2mass1.0/expert_traj/InvertedDoublePendulum-v2_state_only_expert_traj.p 
--seed 2 3 4 --reward-type positive --exp-type friction

From the concrete example we notice that we can pass different "mismatch" values as --mass-mulL. In the experiments we did, these values correspond to multipliers for the mass or friction of the learner.

In order to incorporate a mismatch relevant to your experiment please edit the file gail/algo_gym.py similarly to how we change the mass of the MuJoCo agents.

Please, edit also run_experiment.py to indicate a proper saving path, if the one for MuJoCo is fine for your situation, just add the environment name to the if condition at line 90.

Similarly, you can run the AIRL baseline as:

Run AIRL:

 python run_experiment.py --env-name InvertedDoublePendulum-v2 --algo airl --learning-rate 1e-4 
--alpha 1.0 --num-threads 1 --min-batch-size 3000 --max-iter-num 500 
--log-interval 1 --save-model-interval 1 --mass-mulL 0.5 0.75 1.0 1.25 1.5 2.0  --mass-mulE 1.0 
--expert-traj-path assets/envInvertedDoublePendulum-v2mass1.0/expert_traj/InvertedDoublePendulum-v2_state_only_expert_traj.p 
--seed 2 3 4 --reward-type positive --exp-type friction

Please notice that when you use gaifo or airl as an algorithm you must passe state_only trajectories that are saved by default under the name <environment-name>_state_only_expert_traj.p.

Finally, notice the variable reward-type. It is a well known issue in imitation learning that for some environment -log(D) works better than log(1-D) as a reward. Select negative to use the former or positive for the latter.

Evaluation:

In the work, we are using 2 different type of evaluations.

Learning under mismatch

To evaluate the expert and gaifo for different alphas

python analysis/compare_mujoco_performance.py --env-name <environment-name> --alg expert gaifo --alpha 1.0 0.999 0.99 0.98 0.97 0.96 0.95 0.9 --mass-muls 0.5 0.75 1.0 1.5 2.0 --seed 2 3 4 --friction

the argument --friction can be removed to evaluate the mass mismatch rather than the friction one.

while for evaluate airl:

python analysis/compare_mujoco_performance.py --env-name <environment-name> --alg airl --alpha 1.0 0.999 0.99 0.98 0.97 0.96 0.95 0.9 --mass-muls 0.5 0.75 1.0 1.5 2.0 --seed 2 3 4 --friction

***Evaluating the robustness of learned ***

To evaluate the expert and gaifo for different alphas

python analysis/compare_mujoco_robustness.py --env-name <environment-name> --alg gaifo --alpha 1.0 0.999 0.99 0.98 0.97 0.96 0.95 0.9 --mass-muls 0.5 0.75 1.0 1.5 2.0 --seed 0 1 2 --mismatch 1.5 --friction

while for evaluate airl:

python analysis/compare_mujoco_robustness.py --env-name <environment-name> --alg airl --alpha 1.0 0.999 0.99 0.98 0.97 0.96 0.95 0.9 --mass-muls 0.5 0.75 1.0 1.5 2.0 --seed 2 3 4 --friction

The argument mismatch denotes the mismatch used for learning. While the argument mass-muls expects the range of masses to evaluate at testing time.

Reacher Experiment

python run_ppo.py --env-name gym_reacher:reachNoisy-v0 --max-iter-num 3000 --save-model-interval 500 --mass-mul 0.0

The model is saved in the folder code/assets/envgym_reacher:reachNoisy-v0noise_var0.0/learned_model/

Collect trajectories

python gail/save_expert_traj.py --gym_reacher:reachNoisy-v0 --max-expert-state-num 1000 --model-path assets/envgym_reacher:reachNoisy-v0noise_var0.0/learned_model/gym_reacher:reachNoisy-v0_ppo.p --mass-mul 0.0

The trajectories are saved at assets/envgym_reacher:reachNoisy-v0noise_var0.0/expert_traj/gym_reacher:reachNoisy-v0_state_only_expert_traj.p

Finally, run the GAIFO experiments with:

 python run_experiment.py --env-name gym_reacher:reachNoisy-v0 --algo gaifo --learning-rate 1e-4 
--alpha 1.0 0.999 0.99 0.98 0.97 0.96 0.95 0.9 --num-threads 1 --min-batch-size 3000 --max-iter-num 500 
--log-interval 1 --save-model-interval 1 --mass-mulL 0.0 0.5 0.75 1.0 1.25 1.5 2.0  --mass-mulE 0.0 
--expert-traj-path assets/envgym_reacher:reachNoisy-v0noise_var0.0/expert_traj/gym_reacher:reachNoisy-v0_state_only_expert_traj.p 
--seed 2 3 4 --reward-type positive

If results are bad try --reward-type negative.

--mass-mulL 0.0 0.5 0.75 1.0 1.25 1.5 2.0 this is just an example. Ask Kamal which variances we should try to induce different mismatches.

Running the experiments without SLURM

The commands given above assume that you have a SLURM version installed on your cluster to easily handle the parallelization of the different simulations.

You can still run the code without SLURM as follows:

Train a PPO agent with:

python examples/ppo_gym.py --env-name gym_reach:reachNoisy-v0 --max-iter-num 3000 --save-model-interval 500 --mass-mul 0.0 --num-threads 1

Save the resulting trajectories with:

python gail/save_expert_traj.py --env-name gym_reach:reachNoisy-v0 --max-expert-state-num 1000 --model-path assets/envgym_reach:reachNoisy-v0noise_var0.0/learned_models/gym_reach:reachNoisy-v0_ppo.p --mass-mul 0.0

Finally you can run the robust GAIFO:

 python gail/algo_gym.py --env-name gym_reach:reachNoisy-v0 --alg gaifo --learning-rate 1e-4 
--alpha 1.0 --num-threads 1 --min-batch-size 3000 --max-iter-num 500 
--log-interval 1 --save-model-interval 1 --mass-mulL 0.1  --mass-mulE 0.0 
--expert-traj-path assets/envgym_reach:reachNoisy-v0noise_var0.0/expert_traj/gym_reach:reachNoisy-v0_state_only_expert_traj.p 
--seed 0 --reward-type positive