[Discussion] TorchRL MARL API

Question

[Discussion] TorchRL MARL API

matteobettini opened this issue 2 years ago · 6 comments

Hello everyone, this discussion is the beginning of an extension of the TorchRL MARL API.
Hope to get your feedback.

Potential TorchRL MARL API

This API proposes a general structure that multi-agent environments can use in TorchRL to pass their data to the library. It will not be enforced. Its core tenet is that data processed by the same neural network structure should be stacked (grouped) together to leverage tensor batching and data that is processed by different neural networks should be kept under different keys.

Data format

Agents have observations, done, reward and actions. These values can be processed by the same component or processed by different components. If some values across agents are processed by the same component, they should be stacked (grouped) together under the same key. Grouping happens within a TensorDict with an additional dimension to represent the group size.

Users can optionally maintain in the env a table map from each group to its members.

Let's see a few examples.

Case 1: all agents’ data is processed together

In this example, all agents data will be processed by the same neural network so it is convenient to stack them creating a tensordict with an “n_agents” dimension

TensorDict(
    "agents": (
        "obs_a": Tensor,
        "obs_b": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B,n_agents]),
    "state": Tensor,
batch_size=[*B])

In this example "agents" is the group.
It means that each tensor in “agents” will have a leading shape [*B,n_agents] and can be passed to the same neural network.

Optionally, we can maintain a map from group to agents. Supposing we have 3 agents named "agent_0", "agent_1", "agent_2", we can see that they will be all part of the "agents" group by doing

env.group_map["agents"] = ["agent_0", "agent_1", "agent_2"]

In the above example, all the keys under the "agents" group have an agent dimension. If some keys are, on the other hand, shared (like "state") they should be put in the root TensorDict outside of the group to highlight that they are missing the agent dimension. For example, if done and reward were shared by all agents we would have:

TensorDict(
    "agents": (
        "obs_a": Tensor,
        "obs_b": Tensor,
        "action": Tensor,
    batch_size=[*B,n_agents]),
    "state": Tensor,
    "done": Tensor,
    "reward": Tensor,
batch_size=[*B])

Example neural network for this case

A policy for this use case can look something like

TensorDictSequential(
    TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b")], out_keys=[("agents","action")])
)

A value network for this use case can look something like

TensorDictSequential(
      TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b"),"state"], out_keys=["value"]),
)

Note that even if the agents share the same processing, different parameters can be used for each agent via the use of vmap.

This API is currently supported in TrochRL and it can be used with VMAS. You can see how in this tutorial.

Case 2: some groups of agents share data processing

Sometimes only part of the agents share the data processing. This is because agents might be physically different (heterogeneous) or have different behaviors (neural networks) associated with them (like in MLAgents). Once again we use tensordicts to group agents that share data processing

TensorDict(
    "group_1": (
        "obs_a": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B, n_group_1]),
    "group_2": (
        "obs_a": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B, n_group_2]),
    "state": Tensor,
batch_size=[*B])

Agents can still share “reward” or “done”, in this case you can do like above and put this key out of the groups.

We can check the group membership again, in the group map we can optionally keep:

env.group_map["group_1"] = ["agent_0", "agent_1"]
env.group_map["group_2"] = ["agent_2"]

Example neural network for this case

An example policy

TensorDictSequential(
    	TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","action")]),
    	TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","action")]),
    )

An example policy sharing an hidden state

TensorDictSequential(
    	TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","hidden")]),
    	TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","hidden")]),
    	Module(lambda y1,y2: torch.cat([y1, y2],-2), in_keys=[[("group_1","hidden"), [("group_2","hidden")], out_keys=[“hidden”]),
        TensorDictModule(in_keys=[“hidden_groups”], out_keys=["hidden_processed"]),
        Module(lambda y: (y[:n_group_1,:],y[n_group_1:,:]), in_keys=["hidden_processed"], out_keys=[("group_1","action"),("group_2","action")]),
)

This API is suited for environments with APIs using behavior or groups, such as MLAgents.

Case 3: no agents share processing (groups correspond to individual agents)

All agents can also be independent and each have their own group

TensorDict(
    "agent_0": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
     "agent_1": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "agent_2": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "state": Tensor,
batch_size=[*B])

again we can check that each agent belongs to a group

env.group_map["agent_0"] = ["agent_0"]
env.group_map["agent_1"] = ["agent_1"]
env.group_map["agent_2"] = ["agent_2"]

Example neural network for this case

Exactly like in case 2

This API is suited for environments treating agents as completely independent, such as PettingZoo parallel envs or RLlib.

Important notes (suggested)

A group is a nested tensordict with an action key
The reward and done keys can only be present EITHER: in the root td, or in each and all group tds
the sum of the group sizes is the number of agents
each agent has to belong to one and only one group

Changes required in the library

Allow multiple (nested) action, reward, done keys in #1462
Multiple keys will have to be accounted for also in advantages, losses and modules.

@hyerra @smorad @Acciorocketships @pseudo-rnd-thoughts @RiqiangGao @btx0424 @mattiasmar @vmoens @janblumenkamp

Answer 1 · 2023-08-15T16:19:23.000Z

cc @MarkHaoxiang

Answer 2 · 2023-08-16T08:51:06.000Z

I probably missing something but will this API work with turn-based games, parallel agent games and a dynamic number of agents (that numbers of agents that take actions each turn changes)? It is possible to have games that do all of these, a turn-based game with a dynamic number of agents acting each turn.

Answer 3 · 2023-08-16T09:07:55.000Z

I probably missing something but will this API work with turn-based games, parallel agent games and a dynamic number of agents (that numbers of agents that take actions each turn changes)? It is possible to have games that do all of these, a turn-based game with a dynamic number of agents acting each turn.

Great point!

This API can be used as is for parallel games.

For turn-based games or variable number of agents (I am grouping them under the same roof as to me they are the same (i.e. an agent dropping out is the same as an agent not having its turn)) I can envision two ways (both available to the user).

If some agents within the same group are varying (e.g., case 1 above), the group size should be maintained and a mask can be provided to exclude the respective agents. We are working on action masks in #1404 and in #1421
If agents within groups are not varying, but some groups are dropping out and in (e.g., as in case 3 above and in PettingZoo parallel flavour), the respective groups can be remove/added to the tensordict.

For example, we start with 2 goalies, agent_2 and agent_3

TensorDict(
    "goalies": (
        "obs_a": Tensor
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B, 2]),
     "agent_2": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "agent_3": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "state": Tensor,
batch_size=[*B])

Now agent_2 drops out (or doesn't act)

TensorDict(
    "goalies": (
        "obs_a": Tensor
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B, 2]),
    "agent_3": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "state": Tensor,
batch_size=[*B])

Now the goalies drop out and agent_2 returns

TensorDict(
    "agent_2": (
        "obs_a": Tensor
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "agent_3": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "state": Tensor,
batch_size=[*B])

EDIT:
This second solution will still require some tricks as stacking variable tensordicts over time in a rollout will not be contiguous. i.e., we cannot densify the tensors stacked over time as different timesteps will have different entries.

So maybe the masking solution is the only one feasible in all cases since we want to obtain dense tensors for training

Answer 4 · 2023-08-16T20:53:24.000Z

Also, I currently am using a game where we have a dynamic number of agents. For a bit of technical detail, I'm currently using the Unity MLAgents framework. In that framework, agents may request decisions (they need an action from the policy) at arbitrary timesteps. So maybe at timestep t=2, agent an and b request a decision and at timestep t=3 agent c and d request decisions.

In my game setup, once an agent dies, it no longer request decisions. In TorchRL, I handle this dynamic requesting decision by having a valid_mask key. The valid_mask corresponds to whether or not the agent requested a decision so we know whether or not to set actions for that agent. You can see more about it in #1201. However, the requirement here is that all agents need to initially request a decision so that way we can make the specs for them. Since TorchRL specs are fixed, we don't support adding new agents after the initial timestep; however, old agents can be removed by having their valid_mask being set to False by not requesting decisions for that agent anymore.

I wonder if using this new MARL API though, if maybe we can make it so that new agents can be added/removed completely without the need for a valid mask; however, the behaviors are fixed. That would be a lot more flexible.

Answer 5 · 2023-08-17T09:00:28.000Z

@hyerra that is what I was getting at with the second option in my comment above.

However, I think it is not feasible to remove or add agents/groups over time as stacking the data in the time dimension will be difficult.

I am not sure if we can get rid of masks for turn-bsed/variable agents games even in this new API

Answer 6 · 2023-12-15T09:44:12.000Z

Closing this as inactive, happy to reopen if we need to talk about MARL API further!