PyMARL2 (Updates)

New Features

Parallel Info Runner. Like the Parallel Runner, but it collects info from the environment. There are two ways to send info from the environment. If it is pre-transition info (for example, an Adj Matrix which is needed by a model), then implement the environment's info = get_info() function. If it is post-transition info (for example, useful metrics), then return it in the info output of step().
Info Mac. A modified version of n_mac which provides info to the model.
Callbacks. The user can implement their own callback class. So far, the only method is metrics(), which is called in the learner with frequency of args.learner_log_interval. However, this can be extended in the future. To use a callback, implement a custom class that inherits from Callback, and set callback=custom_callback in the config.
Sparse Support. It is now possible to include torch_geometric.data.Data objects in the environment's info dict. These sparse data objects are combined (over environments and timesteps) in the background, and can be accessed in the model or metrics callback, like any other info.
QGNN. A GNN-based value factorisation method.
Environments. Introduced two new environments: The Estimate Game, and the Set Partitioning Problem.

How to Run

python experiments.py --env [env_name] --config [method_name]

Use experiments.py to run any environment with any method. For example, to run qgnn on the set partitioning problem, run python experiments.py --env qgnn --config set. To specify more parameters, add arguments --params1, params2, etc. For example, to run qmix on the 1o_10b_vs_1r starcraft environment, run python experiments.py --env sc2 --config qmix --params1 env_args.map_name=10_10b_vs_1r. If multiple arguments are given after --paramsi, then multiple runs will be executed sequentially, one with each config param. For example, the command python experiments.py --env estimate --config qgnn --params1 agent=qgnn agent=n_rnn --params2 mixer=qmix mixer=qgnn will execute 4 runs with architectures (model=qgnn, mixer=qmix), (model=qgnn, mixer=qgnn), (model=n_rnn, mixer=qmix), and (model=n_rnn, mixer=qgnn).

QGNN

For more information, see our paper QGNN: Value Function Factorisation with Graph Neural Networks

If you use QGNN in your research, please cite:

@misc{kortvelesy2022qgnn,
      title={QGNN: Value Function Factorisation with Graph Neural Networks}, 
      author={Ryan Kortvelesy and Amanda Prorok},
      year={2022},
      eprint={2205.13005},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

PyMARL2 (Original Documentation)

Open-source code for Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning.

This repository is fine-tuned for StarCraft Multi-agent Challenge (SMAC). For other multi-agent tasks, we also recommend an optimized implementation of QMIX: https://github.com/marlbenchmark/off-policy.

StarCraft 2 version: SC2.4.10. difficulty: 7.

2021.10.28 update: add Google Football Environments [vdn_gfootball.yaml] (use `simple115 features`).

2021.10.4 update: add QMIX with attention (qmix_att.yaml) as a baseline for Communication tasks.

Finetuned-QMIX

There are so many code-level tricks in the Multi-agent Reinforcement Learning (MARL), such as:

Value function clipping (clip max Q values for QMIX)
Value Normalization
Reward scaling
Orthogonal initialization and layer scaling
Adam
Neural networks hidden size
learning rate annealing
Reward Clipping
Observation Normalization
Gradient Clipping
Large Batch Size
N-step Returns(including GAE($\lambda$) and Q($\lambda$) ...)
Rollout Process Number
$\epsilon$-greedy annealing steps
Death Agent Masking

Related Works

Implementation Matters in Deep RL: A Case Study on PPO and TRPO
What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games

Using a few of tricks above (bold texts), we enabled QMIX (qmix.yaml) to solve almost all hard scenarios of SMAC (Fine-tuned hyperparameters for each scenarios).

Senarios	Difficulty	QMIX (batch_size=128)	Finetuned-QMIX
8m	Easy	-	100%
2c_vs_1sc	Easy	-	100%
2s3z	Easy	-	100%
1c3s5z	Easy	-	100%
3s5z	Easy	-	100%
8m_vs_9m	Hard	84%	100%
5m_vs_6m	Hard	84%	90%
3s_vs_5z	Hard	96%	100%
bane_vs_bane	Hard	100%	100%
2c_vs_64zg	Hard	100%	100%
corridor	Super Hard	0%	100%
MMM2	Super Hard	98%	100%
3s5z_vs_3s6z	Super Hard	3%	93%(hidden_size = 256, qmix_large.yaml)
27m_vs_30m	Super Hard	56%	100%
6h_vs_8z	Super Hard	0%	93%($\lambda$ = 0.3)

Re-Evaluation

Afterwards, we re-evaluate numerous QMIX variants with normalized the tricks (a general set of hyperparameters), and find that QMIX achieves the SOTA.

Scenarios	Difficulty	Value-based					Policy-based
		QMIX	VDNs	Qatten	QPLEX	WQMIX	LICA	VMIX	DOP	RIIT
2c_vs_64zg	Hard	100%	100%	100%	100%	100%	100%	98%	84%	100%
8m_vs_9m	Hard	100%	100%	100%	95%	95%	48%	75%	96%	95%
3s_vs_5z	Hard	100%	100%	100%	100%	100%	96%	96%	100%	96%
5m_vs_6m	Hard	90%	90%	90%	90%	90%	53%	9%	63%	67%
3s5z_vs_3s6z	S-Hard	75%	43%	62%	68%	56%	0%	56%	0%	75%
corridor	S-Hard	100%	98%	100%	96%	96%	0%	0%	0%	100%
6h_vs_8z	S-Hard	84%	87%	82%	78%	75%	4%	80%	0%	19%
MMM2	S-Hard	100%	96%	100%	100%	96%	0%	70%	3%	100%
27m_vs_30m	S-Hard	100%	100%	100%	100%	100%	9%	93%	0%	93%
Discrete PP	-	40	39	-	39	39	30	39	38	38
Avg. Score	Hard+	94.9%	91.2%	92.7%	92.5%	90.5%	29.2%	67.4%	44.1%	84.0%

Communication

We also tested our QMIX-with-attention (qmix_att.yaml, $\lambda=0.3$, attention_heads=4) on some maps (from NDQ) that require communication.

Senarios (200w steps)	Difficulty	Finetuned-QMIX (No Communication)	QMIX-with-attention ( Communication)
1o_10b_vs_1r	-	56%	87%
1o_2r_vs_4r	-	50%	95%
bane_vs_hM	-	0%	0%

Google Football

We also tested VDN (vdn_gfootball.yaml) on some maps (from Google Football). Specially, we use simple115 features to train the model (The Google Football original paper use complex CNN features). We did not test QMIX because this environment does not provide global status information.

Senarios	Difficulty	VDN ($\lambda=1.0$)
academy_counterattack_hard	-	0.71 (Test Score)
academy_counterattack_easy	-	0.87 (Test Score)

Usage

PyMARL is WhiRL's framework for deep multi-agent reinforcement learning and includes implementations of the following algorithms:

Value-based Methods:

Actor Critic Methods:

Installation instructions

Install Python packages

# require Anaconda 3 or Miniconda 3
bash install_dependecies.sh

Set up StarCraft II (2.4.10) and SMAC:

bash install_sc2.sh

This will download SC2.4.10 into the 3rdparty folder and copy the maps necessary to run over.

Set up Google Football:

bash install_gfootball.sh

Command Line Tool

Run an experiment

# For SMAC
python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=corridor

# For Difficulty-Enhanced Predator-Prey
python3 src/main.py --config=qmix_predator_prey --env-config=stag_hunt with env_args.map_name=stag_hunt

# For Communication tasks
python3 src/main.py --config=qmix_att --env-config=sc2 with env_args.map_name=1o_10b_vs_1r

# For Google Football (Insufficient testing)
# map_name: academy_counterattack_easy, academy_counterattack_hard, five_vs_five...
python3 src/main.py --config=vdn_gfootball --env-config=gfootball with env_args.map_name=academy_counterattack_hard env_args.num_agents=4

The config files act as defaults for an algorithm or environment.

They are all located in src/config. --config refers to the config files in src/config/algs --env-config refers to the config files in src/config/envs

Run n parallel experiments

# bash run.sh config_name env_config_name map_name_list (arg_list threads_num gpu_list experinments_num)
bash run.sh qmix sc2 6h_vs_8z epsilon_anneal_time=500000,td_lambda=0.3 2 0 5

xxx_list is separated by ,.

All results will be stored in the Results folder and named with map_name.

Kill all training processes

# all python and game processes of current user will quit.
bash clean.sh

Citation

@article{hu2021rethinking,
      title={Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning}, 
      author={Jian Hu and Siyang Jiang and Seth Austin Harding and Haibin Wu and Shih-wei Liao},
      year={2021},
      eprint={2102.03479},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acciorocketships/pymarl2