This repository is the official implementation of the paper accepted by ICLR 2024: Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game. It contains the implementation of the EIR-MAPPO defense method, along with several Multi-Agent Reinforcement Learning (MARL) environments used for evaluating our results, such as Toy, LBF, and SMAC.
Our codebase supports various algorithms and environments, with default parameters specified in ./eir_mappo/configs
. The training algorithm parameters for EIR-MAPPO are located in ./eir_mappo/configs/alg/mappo_advt_belief.yaml
, and parameters for the attacking algorithm are in ./eir_mappo/configs/alg/mappo_traitor_belief.yaml
.
Environment-specific parameters are stored in ./eir_mappo/configs/env
, with the YAML configuration files listed as follows:
MARL Environment | YAML Configuration File |
---|---|
Toy | ./eir_mappo/configs/env/toy.yaml |
LBF | ./eir_mappo/configs/env/lbforaging.yaml |
SMAC (Training) | ./eir_mappo/configs/env/smac.yaml |
SMAC (Attack) | ./eir_mappo/configs/env/smac_traitor.yaml |
To train the agents, execute the following command as an example:
python -u main.py --alg mappo_advt_belief --env smac --exp_name train --map_name 4m_vs_3m --seed 1
--alg
: Sets the algorithm. Using--alg mappo_advt_belief
indicates the use of the training algorithm EIR-MAPPO, with default parameters stored in./eir_mappo/configs/alg/mappo_advt_belief.yaml
.--env
: Sets the MARL environment. Specifying--env smac
selects the SMAC environment for training, with its default parameters located in./eir_mappo/configs/env/smac.yaml
.--map_name
: Specifies the map to be used for training. If--map
is not explicitly set, the default map name specified in the environment's configuration file is used. Themap_name
parameter is ignored when the selected environment isToy
.--exp_name
: Names the experiment. If--exp_name
is not provided, it defaults totest
.--seed
: Specifies the seed for initializing the experiment, with its default set in the algorithm's configuration file.
The models and training data are saved in the the following directories:
# For MARL environments other than Toy
models: ./eir_mappo/results/{env}/{map_name}/mappo_advt_belief/{exp_name}/{seed}/run{iter}/models
data: ./eir_mappo/results/{env}/{map_name}/mappo_advt_belief/{exp_name}/{seed}/run{iter}/logs
# For the Toy MARL environment
models: ./eir_mappo/results/{env}/mappo_advt_belief/{exp_name}/{seed}/run{iter}/models
data: ./eir_mappo/results/{env}/mappo_advt_belief/{exp_name}/{seed}/run{iter}/logs
The {iter}
placeholder in the path is incremented by one with each new run, ensuring that experimental data for the same configuration do not overlap, starting with an initial value of 1.
To attack the models and train the adversarial agents, execute the following command as an example:
python -u main.py --alg mappo_traitor_belief --env smac --exp_name attack_eir_mappo --map_name 4m_vs_3m --seed 1 --agent_adversary 0 --model_dir ./eir_mappo/results/smac/4m_vs_3m/mappo_advt_belief/eir_mappo/1/run1/models
--alg --env --exp_name --map_name --seed
: Parameters are as previously described.--model_dir
: Specifies the directory containing the model to be attacked.--agent_adversary
: Indicates the index of the adversarial agent within the training environment.
The adversarial agent's index can be configured in the algorithm's configuration file (./eir_mappo/configs/alg/mappo_traitor_belief.yaml
).
The models and attack data are saved in the following directories:
# For MARL environments other than Toy
models: ./eir_mappo/results/{env}/{map_name}/mappo_traitor_belief/{exp_name}/{seed}/run{iter}/models
datas: ./eir_mappo/results/{env}/{map_name}/mappo_traitor_belief/{exp_name}/{seed}/run{iter}/logs
# For MARL environments other than Toy
models: ./eir_mappo/results/{env}/mappo_traitor_belief/{exp_name}/{seed}/run{iter}/models
datas: ./eir_mappo/results/{env}/mappo_traitor_belief/{exp_name}/{seed}/run{iter}/logs
{env}, {map_name}, {exp_name}, {seed}, {iter}
: These placeholders are as previously described.
We evaluate our performance under the most arduous non-oblivious attack, where an adversary can manipulate any agent in cooperative tasks and execute an arbitrary learned worst-case policy. We also record the behaviors of the agents under the attack in the videos. These videos showcase our methods alongside the baseline methods in the 12x12-4p-3f-c configuration of the LBF environment and the 4m vs 3m scenario of the SMAC MARL environment, as illustrated in the table below.
Training algorithm | Video Directory |
---|---|
MADDPG | LBF video SMAC video |
M3DDPG | LBF video SMAC video |
MAPPO | LBF video SMAC video |
RMAAC | LBF video SMAC video |
EAR-MAPPO | LBF video SMAC video |
EIR-MAPPO | LBF video SMAC video |
True Type | LBF video SMAC video |