/fast-marl

FAST iteration of MARL research ideas: A starting point for Multi-Agent Reinforcement Learning

Primary LanguagePython

FAST iteration of MARL research ideas: A starting point for Multi-Agent Reinforcement Learning

Algorithm implementations with emphasis on FAST iteration of MARL research ideas. The algorithms are self-contained and the implementations are focusing on simplicity and speed.

All algorithms are implemented in PyTorch and use the Gym interface.

Table of Contents

Getting Started

Installation

We strongly suggest you use a virtual environment for the instructions below. A good starting point is Miniconda. Then, clone and install the repository using:

git clone https://github.com/semitable/fast-marl.git
cd fast-marl
pip install -r requirements.txt
pip install -e .

Running an algorithm

This project uses Hydra to structure its configuration. Algorithm implementations can be found under fastmarl/. The respective configs are found in fastmarl/configs/algorithms/.

You would first need an environment that is registered in OpenAI's Gym. This repository uses the Gym API (with the only difference being that the rewards are a tuple - one for each agent).

A good starting point would be Level-based Foraging and RWARE. You can install both using:

pip install -U lbforaging rware

Then, running an algorithm (e.g. A2C) looks like:

cd fastmarl
python run.py +algorithm=ac env.name="lbforaging:Foraging-8x8-2p-3f-v2" env.time_limit=25

Similarly, running DQN can be done using:

python run.py +algorithm=dqn env.name="lbforaging:Foraging-8x8-2p-3f-v2" env.time_limit=25

Overriding hyperparameters is easy and can be done in the command line. An example of overriding the batch_size in DQN:

python run.py +algorithm=dqn env.name="lbforaging:Foraging-8x8-2p-3f-v2" env.time_limit=25 algorithm.batch_size=256

Find other hyperparameters in the files under fastmarl/configs/algorithm.

(Optional) Use Hydra's tab completion

Hydra also supports tab completion for filling in the hyperparameters. Install it using or see here for other shells (zsh or fish).

eval "$(python run.py -sc install=bash)"

Running a hyperparameter search

Can be easily done using Hydra's multirun option. An example of sweeping over batch sizes is:

python run.py -m +algorithm=dqn env.name="lbforaging:Foraging-8x8-2p-3f-v2" env.time_limit=25 algorithm.batch_size=64,128,256

An advanced hyperparameter search using search.py

This section might get deprecated in the future if Hydra implements this feature.

We include a script named search.py which reads a search configuration file (e.g. the included configs/sweeps/dqn.lbf.yaml) and runs a hyperparameter search in one or more tasks. The script can be run using

python search.py run --config configs/sweeps/dqn.lbf.yaml --seeds 5 locally

In a cluster environment where one run should go to a single process, it can also be called in a batch script like:

python search.py run --config configs/sweeps/dqn.lbf.yaml --seeds 5 single $TASK_ID

Where $TASK_ID is an index for the experiment (i.e. 1...#number of experiments).

Logging

We implement two loggers: FileSystem Logger and WandB Logger.

File System Logger

The default logger is the FileSystemLogger which saves experiment results in a results.csv file. You can find that file, the configuration that has been used & more under outputs/{env_name}/{alg_name}/{random_hash} or multirun/{date}/{time}/{experiment_id} for multiruns.

WandB Logger

By appending +logger=wandb in the command line you can get support for WandB. Do not forget to wandb login first.

Example:

python run.py +algorithm=dqn env.name="lbforaging:Foraging-8x8-2p-3f-v2" env.time_limit=25 logger=wandb

You can override the project name using:

python run.py +algorithm=dqn env.name="lbforaging:Foraging-8x8-2p-3f-v2" env.time_limit=25 logger=wandb logger.project_name="my-project-name"

Implementing your own algorithm/ideas

The fastest way would be to create a new folder starting from the algorithm of your choice e.g.

cp -R ac ac_new_idea

and create a new configuration file:

cp configs/algorithm/ac.yaml configs/algorithm/ac_new_idea.yaml

with the editor of your choice, open ac_new_idea.yaml and change

...
algorithm:
  _target_: ac.train.main
  name: "ac"
  model:
    _target_: ac.model.Policy
...

to

...
algorithm:
  _target_: ac_new_idea.train.main
  name: "ac_new_idea"
  model:
    _target_: ac_new_idea.model.Policy
...

Make any changes you want to the files under ac_new_idea/ and run it using:

python run.py +algorithm=ac_new_idea env.name="lbforaging:Foraging-8x8-2p-3f-v2" env.time_limit=25

You can now add new hyperparameters, change the training procedure, or anything else you want and keep the old implementations for easy comparison. We hope that the way we have implemented these algorithms makes it easy to change any part of the algorithm without the hustle of reading through large code-bases and huge unnecessary layers of abstraction. RL research benefits from iterating over ideas quickly to see how they perform!

Interpreting your results

We have multiple tools to analyze the outputs of FileSystemLogger (for WandBLogger, just login to their webpage). First, export the data of multiple runs using:

python utils/postprocessing/export_multirun.py --folder folder/containing/results --export-file myfile.hd5

The file will contain two pandas DataFrames: df which contains all mean_episode_returns (by default summed across all agents), and config which contains information about the tested hyperparameters.
You can load both through Python using:

import pandas as pd
df = pd.read_hdf(exported_file, "df")
configs = pd.read_hdf(exported_file, "configs")

The imported DataFrames look like the ones below. df has a multi-index column indexing the environment name, the algorithm name, a hash unique to the parameter search, and a seed. configs maps the hash to the full configuration of the run.

In [1]: df
Out[2]: 
                       Foraging-20x20-9p-6f-v2             ...                       
                                       Algo1               ...     Algo2             
                                   f7c2ecb3ddf1            ... 5284ad99ce02          
                                         seed=0    seed=1  ...       seed=0    seed=1
environment_steps                                          ...                       
0                                      0.178373  0.000000  ...     0.089167  0.054286
100000                                 0.026786  0.066667  ...     0.054545  0.033333
200000                                 0.130278  0.084650  ...     0.043333  0.055833
300000                                 0.086111  0.109975  ...     0.182626  0.116768
...

In [3]: configs
Out[4]: 
             algorithm.name  algorithm.lr  algorithm.batch_size
f7c2ecb3ddf1       DQN-FuPS        0.0001                   256
ecaf120f572e       DQN-SePS        0.0001                   128
5a80fe220cfc       DQN-SePS        0.0003                   128
d16939a558b6       DQN-FuPS        0.0003                   256
...

You can easily find the best hyperparameter configuration per environment/algorithm using:

python utils/postprocessing/find_best_hyperparams.py  --exported-file myfile.hd5

You can plot the best runs (average/std across seeds) using:

python utils/postprocessing/plot_best_runs.py --exported-file lbf.dqn.hd5

Finally you can use HiPlot to interactively visualize the performance of various hyperparameter configurations using:

pip install -U hiplot
hiplot fastmarl.utils.postprocessing.hiplot_fetcher.experiment_fetcher

You will have to enter exp://myfile.hd5/env_name/alg_name in the browser's textbox.

Implemented Algorithms

A2C DQN (Double Q)
Parameter Sharing ✔️ ✔️
Selective Parameter Sharing ✔️ ✔️
Centralized Critic ✔️
Value Decomposition ✔️
Return Standarization ✔️ ✔️
Target Networks ✔️ ✔️

Parameter Sharing

Parameter sharing across agents is optional and being done behind the scenes in the torch model. There are three types of parameter sharing:

  • No Parameter Sharing (default)
  • Full Parameter Sharing
  • Selective Parameter Sharing (Christianos et al.)

In DQN you can enable either of these using:

python run.py +algorithm=dqn env.name="lbforaging:Foraging-8x8-4p-3f-v2" env.time_limit=25 algorithm.model.critic.parameter_sharing=False
python run.py +algorithm=dqn env.name="lbforaging:Foraging-8x8-4p-3f-v2" env.time_limit=25 algorithm.model.critic.parameter_sharing=True
python run.py +algorithm=dqn env.name="lbforaging:Foraging-8x8-4p-3f-v2" env.time_limit=25 algorithm.model.critic.parameter_sharing=[0,0,1,1]

for each of the methods respectively. For Selective Parameter Sharing, you need to supply a list of indices pointing to the network that is going to be used for each agent. Example: [0,0,1,1] as above makes the agents 0 and 1 share network 0 and agents 2 and 3 share the network 1. Similarly [0,1,1,1] would make the first agent not share parameters with anyone, and the other three would share parameters.

In Actor-Critic methods you would need to seperately define parameter sharing for the Actor and the Critic. The respective config is algorithm.model.actor.parameter_sharing=... and algorithm.model.critic.parameter_sharing=...

Value Decomposition

We have implemented VDN on top of the DQN algorithm. To use you only have to load the respective algorithm config:

python run.py +algorithm=vdn env.name="lbforaging:Foraging-8x8-4p-3f-v2" env.time_limit=25

Note that for this to work we use the CooperativeReward wrapper that sums the rewards of all agents before feeding them to the training algorithm. If you have an environment that already has a cooperative reward, you still need it to return a list of rewards (e.g. reward = n_agents * [reward/n_agents]).

Contact

Filippos Christianos - f.christianos {at} ed {dot} ac {dot} uk

Project Link: https://github.com/semitable/fast-marl