/deep-marl-toolkit

MARLToolkit: The Multi-Agent Rainforcement Learning Toolkit. Include implementation of MAPPO, MADDPG, QMIX, VDN, COMA, IPPO, QTRAN, MAT...

Primary LanguagePythonApache License 2.0Apache-2.0

MARLToolkit: The Multi Agent Rainforcement Learning Toolkit

MARLToolkit is a Multi-Agent Reinforcement Learning Toolkit based on Pytorch. It provides MARL research community a unified platform for developing and evaluating the new ideas in various multi-agent environments. There are four core features of MARLToolkit.

  • it collects most of the existing MARL algorithms widely acknowledged by the community and unifies them under one framework.
  • it gives a solution that enables different multi-agent environments using the same interface to interact with the agents.
  • it guarantees excellent efficiency in both the training and sampling process.
  • it provides trained results, including learning curves and pretrained models specific to each task and algorithm's combination, with finetuned hyper-parameters to guarantee credibility.

Overview

We collected most of the existing multi-agent environment and multi-agent reinforcement learning algorithms and unified them under one framework based on [Pytorch] to boost the MARL research.

The MARL baselines include independence learning (IQL, A2C, DDPG, TRPO, PPO), centralized critic learning (COMA, MADDPG, MAPPO, HATRPO), and value decomposition (QMIX, VDN, FACMAC, VDA2C) are all implemented.

Popular environments like SMAC, MaMujoco, and Google Research Football are provided with a unified interface.

The algorithm code and environment code are fully separated. Changing the environment needs no modification on the algorithm side and vice versa.

Benchmark Github Stars Learning Mode Available Env Algorithm Type Algorithm Number Continues Control Asynchronous Interact Distributed Training Framework Last Update
PyMARL GitHub stars CP 1 VD 5 * GitHub last commit
PyMARL2 GitHub stars CP 1 VD 12 PyMARL GitHub last commit
off-policy GitHub stars CP 4 IL+VD+CC 4 off-policy GitHub last commit
on-policy GitHub stars CP 4 IL+VD+CC 1 on-policy GitHub last commit
MARL-Algorithms GitHub stars CP 1 VD+Comm 9 * GitHub last commit
EPyMARL GitHub stars CP 4 IL+VD+CC 10 PyMARL GitHub last commit
Marlbenchmark GitHub stars CP+CL 4 VD+CC 5 ✔️ pytorch-a2c-ppo-acktr-gail GitHub last commit
MAlib GitHub stars SP 8 SP 9 ✔️ * GitHub last commit
MARLlib GitHub stars CP+CL+CM+MI 10 IL+VD+CC 18 ✔️ ✔️ ✔️ Ray/RLlib GitHub last commit

CP, CL, CM, and MI represent cooperative, collaborative, competitive, and mixed task learning modes. IL, VD, and CC represent independent learning, value decomposition, and centralized critic categorization. SP represents self-play. Comm represents communication-based learning. Asterisk denotes that the benchmark uses its framework.

Environment

Supported Multi-agent Environments / Tasks

Most of the popular environment in MARL research has been incorporated in this benchmark:

Env Name Learning Mode Observability Action Space Observations
LBF Mixed Both Discrete Discrete
RWARE Collaborative Partial Discrete Discrete
MPE Mixed Both Both Continuous
SMAC Cooperative Partial Discrete Continuous
MetaDrive Collaborative Partial Continuous Continuous
MAgent Mixed Partial Discrete Discrete
Pommerman Mixed Both Discrete Discrete
MaMujoco Cooperative Partial Continuous Continuous
GRF Collaborative Full Discrete Continuous
Hanabi Cooperative Partial Discrete Discrete

Each environment has a readme file, standing as the instruction for this task, talking about env settings, installation, and some important notes.

Algorithm

We provide three types of MARL algorithms as our baselines including:

Independent Learning: IQL DDPG PG A2C TRPO PPO

Centralized Critic: COMA MADDPG MAAC MAPPO MATRPO HATRPO HAPPO

Value Decomposition: VDN QMIX FACMAC VDAC VDPPO

Here is a chart describing the characteristics of each algorithm:

Algorithm Support Task Mode Need Global State Action Learning Mode Type
IQL Mixed No Discrete Independent Learning Off Policy
PG Mixed No Both Independent Learning On Policy
A2C Mixed No Both Independent Learning On Policy
DDPG Mixed No Continuous Independent Learning Off Policy
TRPO Mixed No Both Independent Learning On Policy
PPO Mixed No Both Independent Learning On Policy
COMA Mixed Yes Both Centralized Critic On Policy
MADDPG Mixed Yes Continuous Centralized Critic Off Policy
MAA2C Mixed Yes Both Centralized Critic On Policy
MATRPO Mixed Yes Both Centralized Critic On Policy
MAPPO Mixed Yes Both Centralized Critic On Policy
HATRPO Cooperative Yes Both Centralized Critic On Policy
HAPPO Cooperative Yes Both Centralized Critic On Policy
VDN Cooperative No Discrete Value Decomposition Off Policy
QMIX Cooperative Yes Discrete Value Decomposition Off Policy
FACMAC Cooperative Yes Continuous Value Decomposition Off Policy
VDAC Cooperative Yes Both Value Decomposition On Policy
VDPPO* Cooperative Yes Both Value Decomposition On Policy

IQL is the multi-agent version of Q learning. MAA2C and MATRPO are the centralized version of A2C and TRPO. VDPPO is the value decomposition version of PPO.