agi-brain/xuance

Simple_Reference solving cleanly, but Simple_Speaker_Listener not being solved by MADDPG

MatthewCWeston opened this issue · 7 comments

Hello, I like this library quite a bit - it's been very useful so far. I've been aiming to produce a set of benchmark solutions to the MPE environments using it, but I've run into an issue with asymmetrical environments struggling to perform right under MPE. When testing Simple-Speaker-Listener, the listener moves to the center of the landmarks, rather than to the landmark indicated.

Compounding the strangeness is the fact that simple_reference is solved quite cleanly. I'm currently running on version 1.7, with the following training code:

import xuanpolicy as xp
import torch
# Reference, SSL, World_Comm
from pettingzoo.mpe import simple_reference_v3, simple_speaker_listener_v4, simple_world_comm_v3, simple_spread_v3, simple_push_v3
import imageio
from IPython import display
import types
import os
import numpy as np
from collections import defaultdict
device = 'cuda' if torch.cuda.is_available() else 'cpu'

env_type = simple_speaker_listener_v4
env = env_type.parallel_env(max_cycles=25, continuous_actions=True, render_mode="rgb_array")

asymmetric = True # We train one agent if symmetrical, multiple otherwise.
agent_name = 'maddpg' if not asymmetric else ['maddpg'] * len(env.observation_spaces)

env_class_name = env_type.__name__.split('.')[-1]
env_name = f"{'_'.join(env_class_name.split('_')[:-1])}"
cpath = f'/usr/local/lib/python3.10/dist-packages/xuanpolicy/configs/maddpg/mpe/{env_name}.yaml'
runner = xp.get_runner(agent_name=agent_name, env_name=f"mpe/{env_name}", is_test=False)

The configuration I'm using is as follows - it's identical to the config file used to train simple_reference successfully, save for the name of the environment.

agent: "MADDPG"  # the learning algorithms_marl
env_name: "mpe"
env_id: "simple_speaker_listener_v4"
continuous_action: True
policy: "MADDPG_policy"
representation: "Basic_Identical"
vectorize: "Dummy_MAS"
runner: "MARL"

representation_hidden_size: [32, ]  # the units for each hidden layer
actor_hidden_size: [256, ]
critic_hidden_size: [256, ]
activation: 'ReLU'
activation_action: 'sigmoid'

lr_a: 0.01  # learning rate for actor
lr_c: 0.001  # learning rate for critic
tau: 0.001  # soft update for target networks
sigma: 0.1  # random noise for continuous actions
clip_grad: 0.5

buffer_size: 200000
batch_size: 256
gamma: 0.95  # discount factor

training_steps: 30000
training_frequency: 1

n_tests: 5
test_period: 100
consider_terminal_states: False  # if consider the terminal states when calculate target Q-values.

use_obsnorm: False
use_rewnorm: False
obsnorm_range: 5
rewnorm_range: 5

logdir: "./logs/maddpg/"
modeldir: "./models/maddpg/"

Is there something that I've configured incorrectly? Similar settings seemed to work in the original MADDPG paper.

Thank you for your kind words and for providing additional information about the simple_speaker_listener scenario of MPE. The simple_speaker_listener scenario in multi-agent particle environments (MPE) involves heterogeneous cooperative agents, which is different from the other scenarios. The current version of XuanPolicy may not fully support this particular type of environment as it requires further improvements. I appreciate your acknowledgment of this limitation and the intention to address it in future updates. Continuous development and updates are essential for AI systems like XuanPolicy to adapt and improve over time.

If there's anything else I can assist you with, please let me know!

Thanks for the response. I'd be willing to take a shot at getting asymmetrical environments to work in Xuanpolicy, or at least helping develop things in that direction, if desired.

To start with, I've been trying to get my MARL training script to work in the newest version, since that seems like the best place to develop from. I've started by trying to get it to run Simple_Spread, since it already has a .yaml file for that environment. Using similar code to what's above (adapted for version 0.1.8.2), I ran into an error in the memory buffer storage code, which appears to result from inconsistencies in the buffer sizes. Am I doing something wrong when initializing my environment? The memory buffers look like this:

<name>: <buffer_shape>; <to_store_shape>
actions: (16, 5000, 5); (16, 3, 5)        # populate 3 out of 5,000 rows
obs_next: (16, 5000, 18); (16, 3, 18) # Populate 3 out of 5,000 rows
rewards: (16, 5000, 3, 1); (16, 3, 1)   # Populate 1 out of 5,000 rows 
terminals: (16, 5000, 3); (16, 3)        # Populate 1 out of 5,000 rows
agent_mask: (16, 5000, 3); (16, 3)    # Populate 1 out of 5,000 rows
state: (16, 5000, 54); (16, 54)            # Populate 1 out of 5,000 rows
state_next: (16, 5000, 54); (16, 54)    # Populate 1 out of 5,000 rows

At present, the code for MARL_OffPolicyBuffer.store attempts to move a shared pointer along dimension 1 of each of the arrays, stepping forward by 1 each time something is stored, which works for everything except actions and obs_next. Is this the result of my initializing something incorrectly?

For completeness, the code I used for this test is below:

env_type = simple_spread_v3

env_class_name = env_type.__name__.split('.')[-1]
runner = xp.get_runner(method='maddpg', env='mpe', env_id=f"{env_class_name}", is_test=False)

runner.run()

Hi, we've noticed that there was an issue with the MARL_OffPolicyBuffer.store() function due to incorrect buffer data shape initialization in the MARL_OffPolicyBuffer.clear() function. Thank you for bringing it to our attention! We've updated the clear() function in the latest version of XuanPolicy.

We apologize for not uploading the newest version promptly and please feel free to let us know about any other potential issues.

I ran the system again, on the newest version. Simple_spread solves on both of them with MADDPG, see attached pictures. I also found an error in xuanpolicy/environment/pettingzoo/init.py: "simple_speak_listener_v4" should be "simple_speaker_listener_v4". You might also want to look at the difference between "model_dir", "modeldir", "log_dir", and "logdir" in common_tools.py, lines 124 to 163 - it's not major, but might be unintended.

17_scores
18_scores

Making those changes, I was able to run simple_speaker_listener with the same hyperparameters as simple_spread. Unfortunately, the model ended up in a local maxima where the speaker always output all zeros, and the listener always moved to the center of the three landmarks. Do you have any advice?

ssl_scores

For completeness, I ran simple_reference, again with the same hyperparameters as in the default config for simple_spread. It was able to solve the environment fairly cleanly and reliably. Given that simple_reference is more complex than simple_speaker_listener, it does continue to confuse me that the former works and the latter doesn't.

sr_scores

Runner initialization code for symmetrical and asymmetrical environments is below, in case it is relevant. Everything else, unless otherwise stated, is identical:

# For ss and sr
runner = xp.get_runner(method='maddpg', env='mpe', env_id=f"{env_class_name}", is_test=False)
# For ssl
#runner = xp.get_runner(method=['maddpg']*2, env='mpe', env_id=f"{env_class_name}", is_test=False)

I think I figured out what was going wrong - I looked through some benchmarking papers, and MADDPG and MATD3 struggle with the SSL environment. Since you also have implementations of MAPPO, QMIX, and MASAC, which are known to reliably solve it, I decided to try them out. Unfortunately, I wasn't able to get either of the three to cleanly solve my target environments. I found some information that might be useful, provided below:


MAPPO's MPE Simple_Spread example config doesn't work on the current build, because it's missing a few fields. Specifically, _use_global_state, gain, weight_decay, use_linear_lr_decay, use_huber_loss, huber_delta, use_value_norm, and end_factor_lr_decay _. With those added, it throws a size mismatch error when run - it appears to be passing a single agent's observation into the multi-agent critic, on line 138 of categorical_marl (from line 73 of mappo_learner). Debugging the issue, I found that the following lines of mappo_agents:

if self.use_global_state:
    state = torch.Tensor(state).unsqueeze(1).to(self.device)
    critic_in = state.expand(-1, self.n_agents, -1)
else:
    critic_in = torch.Tensor(obs_n).view([batch_size, 1, -1]).to(self.device)
    critic_in = critic_in.expand(-1, self.n_agents, -1)

pass the correct information to the value function. Adjusting line 73 accordingly, I am able to run simple_reference. MAPPO predicts discrete actions rather than continuous ones, so I ran tests on the discrete environment rather than continuous. Training didn't seem to succeed, though - performance was relatively low, and improved only slowly over the course of training. I have attached the config file I used for MAPPO. It's possible that this is where I went wrong, so if you could have a quick look for anything that's clearly incorrect, it might help substantially. Once I've got a working build, I'll share everything that got me there so that it can be merged into the master repo if desired.

mappo_config.txt


MASAC's default config file for simple_spread appears to be outdated, and is likewise missing a number of required fields. The policy name, "Gaussian_MASAC_Policy" has a typo - the 'p' is lowercase, resulting in an error. Line 35 of masac_agents.py didn't match MARL_OffPolicyBuffer's initialization arguments. I replaced it with the below:


memory = MARL_OffPolicyBuffer(len(config.agent_keys),
                                      state_shape,
                                      config.obs_shape,
                                      config.act_shape,
                                      config.rew_shape,
                                      config.done_shape,
                                      envs.num_envs,
                                      config.buffer_size,
                                      config.batch_size)

From there, MASAC_Agents.act expects an 'episode' argument, but doesn't use it. I replaced line 49 with:

def act(self, obs_n, test_mode, state=None, noise=False):

MASAC's act function only returns an action, so I had to add the following at line 170 of runner_pettingzoo:

elif self.marl_names[h] in ["MASAC"]:
                a = mas_group.act(obs_n[h], test_mode)
				actions_n_onehot.append(a_onehot)

I also had to change the train function of masac_agents to:

def train(self, i_episode):
	sample = self.memory.sample()
	info_train = self.learner.update(sample)
	return info_train

There were a few other version incompatibilities that I didn't manage to identify. I can see if I can patch things more completely and get some test results - let me know if that's desired.


QMIX runs out of the box on simple_spread, but the average reward actually seems to decrease over time. Below is the average reward over one million training steps.

qmix_simple_spread


Update:

Since the primary issue seemed to be version incompatibility, I rolled back to 1.7 and tried some experiments.


MASAC on simple spread ran alright out of the box. Ran it on simple_reference as well, with the same hyperparameters. On both, it performed alright, though not perfect - potentially due to a relatively low training duration. I then ran it on simple_speaker_listener, where it fell into the standard failure state of averaging the three landmark locations.


QMIX likewise ran right out of the box. Tried it on simple_reference first, and it finished training quite quickly. Unfortunately, it didn't seem to work - agents usually converged on the center of the landmarks.


MAPPO is split into two categories in 1.7: mappoclip and mappokl. Both throw assertion errors, since the environment builder doesn't set the flag for discrete actions. I might take a shot at fixing that tomorrow or Friday.

Update 2:

Took a shot at fixing it. Patched a compatibility issue that was causing the assertion error, and made a few other changes. mappo_clip solves simple_spread nicely in 20,000 training steps. I gave it 40,000 with the same hyperparameters to solve simple_reference, but it simply averaged the landmarks, and didn't learn to communicate. Tried mappo_kl under the same conditions, and it also failed to solve the environment, though I could try a longer running time.


As far as simple_speaker_listener, simple_world, and simple_reference, what algorithm and version would you recommend?

Thank you once again for your valuable advice. We will improve the repo according to your suggestions and comments very soon.

As for your question about simple_speaker_listener, simple_world, and simple_reference, maybe the following repos can help you.

  1. https://github.com/zoeyuchao/mappo.git
  2. https://github.com/marlbenchmark/off-policy.git

Thanks - I took a look at the former, but the latter is new to me.

Since this repo solves simple_reference well, I think I might be able to figure out what's going wrong with simple_speaker_listener. If I do, I'll report back with what I learn.