Better defenses against Adversarial Policies in Reinforcement Learning

Defending against adversarial policies in YouShallNotPass by running adversarial fine-tuning. Policies are trained in an alternating fashion: after training the adversary for t₁ steps, the victim is trained for t₂ steps, then the adversary is trained again for t₃ time-steps and so on. Training times t_i increase exponentially.

Bursts training: (left) training opponents ('normal' pre-trained, adversary trained from scratch, victim policy) in an alternating way (middle) 'burst' size (right) win rate

Bursts training: (left) mean reward for agents, (right) value loss for agents

In this repository:

YouShallNotPass environment is exported into rllib as a multiagent environment
Training in 'bursts' is implemented: victim or the adversary are trained against each other, the policy trained changes every t_i time-steps, and t_i increase exponentially
Victim is trained against multiple adversaries as well as the normal opponent ('population-based training')
Stable Baselines are connected to rllib to train by sampling with rllib and optimizing with Stable Baslines

Setup

Very simple: pull a Docker image

First, pull the image:

$ docker pull humancompatibleai/better-adversarial-defenses
To run tests (will ask for a MuJoCo license)

$ docker run -it humancompatibleai/better-adversarial-defenses
To run the terminal:

$ docker run -it humancompatibleai/better-adversarial-defenses /bin/bash

A bit harder: build a Docker image

Click to open

Install Docker and git
Clone the repository: $ git clone https://github.com/HumanCompatibleAI/better-adversarial-defenses.git
Build the Docker image: $ docker build -t ap_rllib better-adversarial-defenses
Run tests: $ docker container run -it ap_rllib
Run shell: $ docker container run -it ap_rllib /bin/bash

Hard: set up the environment manually

Click to open

Assuming Ubuntu Linux distribution or a compatible one.

Tested in Ubuntu 18.04.5 LTS and WSL. GPU is not required for the project.

Full installation can be found in Dockerfile.

Install miniconda
$ git clone --recursive https://github.com/HumanCompatibleAI/better-adversarial-defenses.git
Create environments from files adv-tf1.yml and adv-tf2.yml (tf1 is used for stable baselines, and tf2 is used for rllib):
- $ conda env create -f adv-tf1.yml
- $ conda env create -f adv-tf2.yml
Install MuJoCo 1.13. On headless setups, install Xvfb
Install MongoDB and create a database chai
Install gym_compete and aprl via setup.py (included into the repository as submodules):
- $ pip install -e multiagent-competition
- $ pip install -e pip install -e adversarial-policies
Having ray 0.8.6 installed, run $ python ray/python/ray/setup-dev.py to patch your ray installation
Install fonts for rendering: $ conda install -c conda-forge mscorefonts; mkdir ~/.fonts; cp $CONDA_PREFIX/fonts/*.ttf ~/.fonts; fc-cache -f -v
Install the project: $ pip install -e .

How to train

To test the setup with rllilb PPO trainer, run:

(adv-tf2) $ python -m ap_rllib.train --tune test
- The script will automatically log results to Sacred and Tune
- By-default, the script asks which configuration to run, but it can be set manually with the --tune argument.
- Log files will appear in ~/ray_results/run_type/run_name. Use TensorBoard in this folder.,
  - run_type is determined by the configuration (config['_call']['name'] attribute). See config.py.
  - run_name is determined by tune -- see output of the train script.
- Checkpoints will be in ~/ray_results/xxx/checkpoint_n/ where xxx and n are stored in the log files, one entry for every iteration. See an example notebook or a script obtaining the last checkpoint for details on how to do that.
- Some specifig configurations:
  - --tune external_cartpole runs training in InvertedPendulum, using Stable Baselines PPO implementation.
    - Before running, launch the Stable Baselines server (adv-tf1) $ python -m frankenstein.stable_baselines_server
      - By-default, each policy is trained in a separate thread, so that environment data collection resumes as soon as possible
      - However, this increases the number of threads significantly in case of PBT and many parallel tune trials.
      - If the number of threads is too high, the --serial option disables multi-threaded training in Stable Baselines Server
      - The overhead is not that significant, as training finishes extremely quickly compared to data collection
  - --tune bursts_exp_withnormal_pbt_sb will run training with Stable Baselines + Bursts + Normal opponent included + PBT (multiple adversaries)
- --verbose enables some additional output
- --show_config only shows configuration and exits
- --resume will re-start trials if there are already trials in the results directory with this name
  - notebook tune_pre_restart.ipynb allows to convert ray 0.8.6 checkpoints to ray 1.0.1 checkpoints
- If you want to quickly iterate with your config (use smaller batch size and no remote workers), pass an option to the trainer
  
  --config_override='{"train_batch_size": 1000, "sgd_minibatch_size": 1000, "num_workers": 0, "_run_inline": 1}'
- Large number of processes might run into the open files limit. This might help: ulimit -n 999999
To make a video:
- (only on headless setups): $ Xvfb -screen 0 1024x768x24&; export DISPLAY=:0
- Run (adv-tf2) $ python -m ap_rllib.make_video --checkpoint path/to/checkpoint/checkpoint-xxx --config your-config-at-training --display $DISPLAY
  - --steps n number of steps to make (1 is 256steps which is approximately 1 episode)
  - --load_normal True evaluate against normal opponent instead of the trained one
  - --no_video True will disable video. Use this to evaluate the performance with more episodes faster

Design choices

We use ray because of its multi-agent support, and thus we have to use TensorFlow 2.0
We use stable baselines for training because we were unable to replicate results with rllib, even with an independent search for hyperparameters.
We checkpoint the ray trainer and restore it, and run the whole thing in a separate process to circumvent the ray memory leak issue

Files and folders structure

Click to open

Files:

ap_rllib/train.py the main train script
ap_rllib/config.py configurations for the train script
ap_rllib/helpers.py helper functions for the whole project
ap_rllib/make_video.py creates videos for the policies
frankenstein/remote_trainer.py implements an RLLib trainer that pickles data and sends the filename via HTTP
frankenstein/stable_baselines_server.py implements an HTTP server that waits for weights and samples, then trains the policy and returns the updated weights
frankenstein/stable_baselines_external_data.py implements the 'fake' Runner that allows for the training using Stable Baselines ppo2 algorithm on existing data
gym_compete_rllib/gym_compete_to_rllib.py implements the adapter for the multicomp to rllib environments, and the rllib policy that loads pre-trained weights from multicomp
gym_compete_rllib/load_gym_compete_policy.py loads the multicomp weights into a keras policy
gym_compete_rllib/layers.py implements the observation/value function normalization code from MlpPolicyValue (multiagent-competition/gym_compete/policy.py)

Folders:

ap_rllib_experiment_analysis/notebooks contains notebooks that analyze runs
ap_rllib_experiment_analysis contains scripts that help with analyzing runs
frankenstein contains the code for integrating Stable Baselines and RLLib
gym_compete_rllib connects rllib to the multicomp environment

Submodules:

adversarial-policies is the original project by Adam Gleave
multiagent-competition contains the environments used in the original project, as well as saved weights
ray is a copy of the ray repository with patches to make the project work

Additional files (see folder `other`)

memory_profile, oom_dummy contains files and data to analyze the memory leak
rock_paper_scissors contain code with sketch implementations of ideas on Rock-Paper-Scissors game
tf_agents_ysp.py implements training in YouShallNotPass with tf-agents
rlpyt_run.py implements training in YouShallNotPass with rlpyt
rs.ipynb implements random search with a constant output policy in YouShallNotPass
evolve.ipynb and evolve.py implement training in YouShallNotPass with neat-python

HumanCompatibleAI/better-adversarial-defenses

Better defenses against Adversarial Policies in Reinforcement Learning

Setup

Very simple: pull a Docker image

A bit harder: build a Docker image

Hard: set up the environment manually

How to train

Design choices

Files and folders structure

Additional files (see folder other)

Additional files (see folder `other`)