/better-adversarial-defenses

Training in bursts for defending against adversarial policies

Primary LanguagePython

Better defenses against Adversarial Policies in Reinforcement Learning

Defending against adversarial policies in YouShallNotPass by running adversarial fine-tuning. Policies are trained in an alternating fashion: after training the adversary for t1 steps, the victim is trained for t2 steps, then the adversary is trained again for t3 time-steps and so on. Training times ti increase exponentially.

Bursts training: (left) training opponents ('normal' pre-trained, adversary trained from scratch, victim policy) in an alternating way (middle) 'burst' size (right) win rate

Bursts training: (left) mean reward for agents, (right) value loss for agents

In this repository:

  1. YouShallNotPass environment is exported into rllib as a multiagent environment
  2. Training in 'bursts' is implemented: victim or the adversary are trained against each other, the policy trained changes every ti time-steps, and ti increase exponentially
  3. Victim is trained against multiple adversaries as well as the normal opponent ('population-based training')
  4. Stable Baselines are connected to rllib to train by sampling with rllib and optimizing with Stable Baslines

Setup

Build Status

Very simple: pull a Docker image

  1. First, pull the image:

    $ docker pull humancompatibleai/better-adversarial-defenses

  2. To run tests (will ask for a MuJoCo license)

    $ docker run -it humancompatibleai/better-adversarial-defenses

  3. To run the terminal:

    $ docker run -it humancompatibleai/better-adversarial-defenses /bin/bash

A bit harder: build a Docker image

Click to open

  1. Install Docker and git
  2. Clone the repository: $ git clone https://github.com/HumanCompatibleAI/better-adversarial-defenses.git
  3. Build the Docker image: $ docker build -t ap_rllib better-adversarial-defenses
  4. Run tests: $ docker container run -it ap_rllib
  5. Run shell: $ docker container run -it ap_rllib /bin/bash

Hard: set up the environment manually

Click to open

Assuming Ubuntu Linux distribution or a compatible one.

Tested in Ubuntu 18.04.5 LTS and WSL. GPU is not required for the project.

Full installation can be found in Dockerfile.

  1. Install miniconda
  2. $ git clone --recursive https://github.com/HumanCompatibleAI/better-adversarial-defenses.git
  3. Create environments from files adv-tf1.yml and adv-tf2.yml (tf1 is used for stable baselines, and tf2 is used for rllib):
    • $ conda env create -f adv-tf1.yml
    • $ conda env create -f adv-tf2.yml
  4. Install MuJoCo 1.13. On headless setups, install Xvfb
  5. Install MongoDB and create a database chai
  6. Install gym_compete and aprl via setup.py (included into the repository as submodules):
    • $ pip install -e multiagent-competition
    • $ pip install -e pip install -e adversarial-policies
  7. Having ray 0.8.6 installed, run $ python ray/python/ray/setup-dev.py to patch your ray installation
  8. Install fonts for rendering: $ conda install -c conda-forge mscorefonts; mkdir ~/.fonts; cp $CONDA_PREFIX/fonts/*.ttf ~/.fonts; fc-cache -f -v
  9. Install the project: $ pip install -e .

How to train

  1. To test the setup with rllilb PPO trainer, run:

    (adv-tf2) $ python -m ap_rllib.train --tune test

    • The script will automatically log results to Sacred and Tune

    • By-default, the script asks which configuration to run, but it can be set manually with the --tune argument.

    • Log files will appear in ~/ray_results/run_type/run_name. Use TensorBoard in this folder.,

      • run_type is determined by the configuration (config['_call']['name'] attribute). See config.py.
      • run_name is determined by tune -- see output of the train script.
    • Checkpoints will be in ~/ray_results/xxx/checkpoint_n/ where xxx and n are stored in the log files, one entry for every iteration. See an example notebook or a script obtaining the last checkpoint for details on how to do that.

    • Some specifig configurations:

      • --tune external_cartpole runs training in InvertedPendulum, using Stable Baselines PPO implementation.
        • Before running, launch the Stable Baselines server (adv-tf1) $ python -m frankenstein.stable_baselines_server
          • By-default, each policy is trained in a separate thread, so that environment data collection resumes as soon as possible
          • However, this increases the number of threads significantly in case of PBT and many parallel tune trials.
          • If the number of threads is too high, the --serial option disables multi-threaded training in Stable Baselines Server
          • The overhead is not that significant, as training finishes extremely quickly compared to data collection
      • --tune bursts_exp_withnormal_pbt_sb will run training with Stable Baselines + Bursts + Normal opponent included + PBT (multiple adversaries)
    • --verbose enables some additional output

    • --show_config only shows configuration and exits

    • --resume will re-start trials if there are already trials in the results directory with this name

    • If you want to quickly iterate with your config (use smaller batch size and no remote workers), pass an option to the trainer

      --config_override='{"train_batch_size": 1000, "sgd_minibatch_size": 1000, "num_workers": 0, "_run_inline": 1}'

    • Large number of processes might run into the open files limit. This might help: ulimit -n 999999

  2. To make a video:

    • (only on headless setups): $ Xvfb -screen 0 1024x768x24&; export DISPLAY=:0

    • Run (adv-tf2) $ python -m ap_rllib.make_video --checkpoint path/to/checkpoint/checkpoint-xxx --config your-config-at-training --display $DISPLAY

      • --steps n number of steps to make (1 is 256steps which is approximately 1 episode)
      • --load_normal True evaluate against normal opponent instead of the trained one
      • --no_video True will disable video. Use this to evaluate the performance with more episodes faster

Design choices

  1. We use ray because of its multi-agent support, and thus we have to use TensorFlow 2.0
  2. We use stable baselines for training because we were unable to replicate results with rllib, even with an independent search for hyperparameters.
  3. We checkpoint the ray trainer and restore it, and run the whole thing in a separate process to circumvent the ray memory leak issue

Files and folders structure

Click to open

Files:

Folders:

Submodules:

Additional files (see folder other)