Defending against adversarial policies in YouShallNotPass by running adversarial fine-tuning. Policies are trained in an alternating fashion: after training the adversary for t1 steps, the victim is trained for t2 steps, then the adversary is trained again for t3 time-steps and so on. Training times ti increase exponentially.
Bursts training: (left) training opponents ('normal' pre-trained, adversary trained from scratch, victim policy) in an alternating way (middle) 'burst' size (right) win rate
Bursts training: (left) mean reward for agents, (right) value loss for agents
In this repository:
YouShallNotPass
environment is exported into rllib as a multiagent environment- Training in 'bursts' is implemented: victim or the adversary are trained against each other, the policy trained changes every ti time-steps, and ti increase exponentially
- Victim is trained against multiple adversaries as well as the normal opponent ('population-based training')
- Stable Baselines are connected to rllib to train by sampling with rllib and optimizing with Stable Baslines
Very simple: pull a Docker image
-
First, pull the image:
$ docker pull humancompatibleai/better-adversarial-defenses
-
To run tests (will ask for a MuJoCo license)
$ docker run -it humancompatibleai/better-adversarial-defenses
-
To run the terminal:
$ docker run -it humancompatibleai/better-adversarial-defenses /bin/bash
Click to open
- Install Docker and git
- Clone the repository:
$ git clone https://github.com/HumanCompatibleAI/better-adversarial-defenses.git
- Build the Docker image:
$ docker build -t ap_rllib better-adversarial-defenses
- Run tests:
$ docker container run -it ap_rllib
- Run shell:
$ docker container run -it ap_rllib /bin/bash
Click to open
Assuming Ubuntu Linux distribution or a compatible one.
Tested in Ubuntu 18.04.5 LTS and WSL. GPU is not required for the project.
Full installation can be found in Dockerfile
.
- Install miniconda
$ git clone --recursive https://github.com/HumanCompatibleAI/better-adversarial-defenses.git
- Create environments from files
adv-tf1.yml
andadv-tf2.yml
(tf1 is used for stable baselines, and tf2 is used for rllib):$ conda env create -f adv-tf1.yml
$ conda env create -f adv-tf2.yml
- Install MuJoCo 1.13. On headless setups, install Xvfb
- Install MongoDB and create a database
chai
- Install
gym_compete
andaprl
via setup.py (included into the repository as submodules):$ pip install -e multiagent-competition
$ pip install -e pip install -e adversarial-policies
- Having ray 0.8.6 installed, run
$ python ray/python/ray/setup-dev.py
to patch your ray installation - Install fonts for rendering:
$ conda install -c conda-forge mscorefonts; mkdir ~/.fonts; cp $CONDA_PREFIX/fonts/*.ttf ~/.fonts; fc-cache -f -v
- Install the project:
$ pip install -e .
-
To test the setup with rllilb PPO trainer, run:
(adv-tf2) $ python -m ap_rllib.train --tune test
-
The script will automatically log results to Sacred and Tune
-
By-default, the script asks which configuration to run, but it can be set manually with the
--tune
argument. -
Log files will appear in
~/ray_results/run_type/run_name
. Use TensorBoard in this folder., -
Checkpoints will be in
~/ray_results/xxx/checkpoint_n/
wherexxx
andn
are stored in the log files, one entry for every iteration. See an example notebook or a script obtaining the last checkpoint for details on how to do that. -
Some specifig configurations:
--tune external_cartpole
runs training in InvertedPendulum, using Stable Baselines PPO implementation.- Before running, launch the Stable Baselines server
(adv-tf1) $ python -m frankenstein.stable_baselines_server
- By-default, each policy is trained in a separate thread, so that environment data collection resumes as soon as possible
- However, this increases the number of threads significantly in case of PBT and many parallel tune trials.
- If the number of threads is too high, the
--serial
option disables multi-threaded training in Stable Baselines Server - The overhead is not that significant, as training finishes extremely quickly compared to data collection
- Before running, launch the Stable Baselines server
--tune bursts_exp_withnormal_pbt_sb
will run training with Stable Baselines + Bursts + Normal opponent included + PBT (multiple adversaries)
-
--verbose
enables some additional output -
--show_config
only shows configuration and exits -
--resume
will re-start trials if there are already trials in the results directory with this name- notebook tune_pre_restart.ipynb allows to convert ray 0.8.6 checkpoints to ray 1.0.1 checkpoints
-
If you want to quickly iterate with your config (use smaller batch size and no remote workers), pass an option to the trainer
--config_override='{"train_batch_size": 1000, "sgd_minibatch_size": 1000, "num_workers": 0, "_run_inline": 1}'
-
Large number of processes might run into the open files limit. This might help:
ulimit -n 999999
-
-
To make a video:
-
(only on headless setups):
$ Xvfb -screen 0 1024x768x24&; export DISPLAY=:0
-
Run
(adv-tf2) $ python -m ap_rllib.make_video --checkpoint path/to/checkpoint/checkpoint-xxx --config your-config-at-training --display $DISPLAY
--steps n
number of steps to make (1 is256
steps which is approximately 1 episode)--load_normal True
evaluate against normal opponent instead of the trained one--no_video True
will disable video. Use this to evaluate the performance with more episodes faster
-
- We use ray because of its multi-agent support, and thus we have to use TensorFlow 2.0
- We use stable baselines for training because we were unable to replicate results with rllib, even with an independent search for hyperparameters.
- We checkpoint the ray trainer and restore it, and run the whole thing in a separate process to circumvent the ray memory leak issue
Click to open
Files:
ap_rllib/train.py
the main train scriptap_rllib/config.py
configurations for the train scriptap_rllib/helpers.py
helper functions for the whole projectap_rllib/make_video.py
creates videos for the policiesfrankenstein/remote_trainer.py
implements an RLLib trainer that pickles data and sends the filename via HTTPfrankenstein/stable_baselines_server.py
implements an HTTP server that waits for weights and samples, then trains the policy and returns the updated weightsfrankenstein/stable_baselines_external_data.py
implements the 'fake' Runner that allows for the training using Stable Baselines ppo2 algorithm on existing datagym_compete_rllib/gym_compete_to_rllib.py
implements the adapter for themulticomp
torllib
environments, and therllib
policy that loads pre-trained weights frommulticomp
gym_compete_rllib/load_gym_compete_policy.py
loads themulticomp
weights into a keras policygym_compete_rllib/layers.py
implements the observation/value function normalization code fromMlpPolicyValue
(multiagent-competition/gym_compete/policy.py
)
Folders:
ap_rllib_experiment_analysis/notebooks
contains notebooks that analyze runsap_rllib_experiment_analysis
contains scripts that help with analyzing runsfrankenstein
contains the code for integrating Stable Baselines and RLLibgym_compete_rllib
connects rllib to themulticomp
environment
Submodules:
adversarial-policies
is the original project by Adam Gleavemultiagent-competition
contains the environments used in the original project, as well as saved weightsray
is a copy of theray
repository with patches to make the project work
memory_profile
,oom_dummy
contains files and data to analyze the memory leakrock_paper_scissors
contain code with sketch implementations of ideas on Rock-Paper-Scissors gametf_agents_ysp.py
implements training inYouShallNotPass
with tf-agentsrlpyt_run.py
implements training inYouShallNotPass
with rlpytrs.ipynb
implements random search with a constant output policy inYouShallNotPass
evolve.ipynb
andevolve.py
implement training inYouShallNotPass
with neat-python