alex-petrenko/sample-factory

The question of implementing PPO in multi-agent environments

Closed this issue · 4 comments

Hello, I am a beginner in multi-agent reinforcement learning. I have noticed this work and I am very interested in it. But I have some confusion about the implementation of PPO in multi-agent environments.

Firstly, we assume a isomorphic multi-agent environment. Each intelligent agent does not communicate, they can only obtain their own local observations, and they need to use some resources to complete their tasks. However, resources are limited and must be used in a reasonable order to ensure that each agent can complete their tasks. If they lack the awareness of cooperation, there may be situations such as life and death locks. In this scenario, each agent shares the same strategy. We hope that after training, each intelligent agent can demonstrate a sense of cooperation. Among them, each step will cause the agent to receive a small punishment, but when the agent completes its task, it will receive a huge reward.

Based on the above environment, my question is as follows:

  • (1) I would like to know that the optimization goal in the multi-agent environment of sample factory is to maximize the reward of a single agent? Or is it about maximizing the total reward for multi-agent systems?
  • (2) In a multi robot environment, is the purpose of num_agents to increase the data size of the sample (increasing the total batch size in one round?), or is it necessary to consider processing a set of data with a num_agents every time the network is updated?
  • (3) After one of the agents completes its task, we will set is_active=True. So, if it corresponds to the first scenario in (2), is it enough to simply discard the sample? Therefore, can I understand that I am still considering a single agent problem, but the environmental changes of this agent during the training process change with its strategy?
  • (4) Multi Policy Training: If each agent is isomorphic and shares the same policy, can Multi Policy Training be used, or only one policy need to be set?

The above is my confusion, but as a beginner, I may not have expressed it accurately enough or used inappropriate words. Please forgive me. I am very much looking forward to receiving help to resolve the confusion in my heart.

(1) I would like to know that the optimization goal in the multi-agent environment of sample factory is to maximize the reward of a single agent? Or is it about maximizing the total reward for multi-agent systems?

The typical config (e.g. the Vizdoom multi-agent examples) will do single-agent optimization, also known as self-play.

You can try implementing a multi-agent joint optimization by feeding all observations into the same policy, and likewise, to output multiple actions per step (e.g. using multi-head action space, like multi-discrete). These can be quite tricky to get to work, depending on the complexity of the env and obs/action spaces.

(2) In a multi robot environment, is the purpose of num_agents to increase the data size of the sample (increasing the total batch size in one round?), or is it necessary to consider processing a set of data with a num_agents every time the network is updated?

I assume you're referring to vectorized envs like IsaacGym. We just treat vectorized envs where agents don't interact with each other, and true multi-agent envs the same way. num_agents determines how many decisions need to be generated to advance an env one step.

(3) After one of the agents completes its task, we will set is_active=True. So, if it corresponds to the first scenario in (2), is it enough to simply discard the sample? Therefore, can I understand that I am still considering a single agent problem, but the environmental changes of this agent during the training process change with its strategy?

I assume you meant is_active=False?
Setting this flag to false will effectively discard samples for this agent until the end of the episode or until the flag is set back to true.

  1. Multi Policy Training: If each agent is isomorphic and shares the same policy, can Multi Policy Training be used, or only one policy need to be set?

These are really just separate settings in the setup. You can train multiple policies with a single-agent env, or a single policy with multi-agent env (single policy controls all agents), or multiple policies in multi-agent env (there's some mapping between policies and agents). See this page for details: https://www.samplefactory.dev/07-advanced-topics/multi-policy-training/?h=mapping#multi-policy-training

Thank you very much for your answer, but I still have some confusion.

The typical config (e.g. the Vizdoom multi-agent examples) will do single-agent optimization, also known as self-play.

In the Vizdoom multi-agent environment you mentioned, are there multiple intelligent agents simultaneously attacking the same target?And only the intelligent agent that actually kills the target can receive the reward?

I assume you're referring to vectorized envs like IsaacGym. We just treat vectorized envs where agents don't interact with each other, and true multi-agent envs the same way. num_agents determines how many decisions need to be generated to advance an env one step.

Does the true multi-agent envs you mentioned refer to the real environment or other meanings?
In a single agent environment, batch_size=num_envs * num_steps, while in a multi-agent environment, batch_size=num_envs * num_steps * num_agents . Is it only valid in a vectorized envs? Is my understanding correct?
I would like to know how num_agents determines how many decisions need to be generated to advance the environment, so which part of the code should I learn about?

Thank you again for your reply!

In the Vizdoom multi-agent environment you mentioned, are there multiple intelligent agents simultaneously attacking the same target?And only the intelligent agent that actually kills the target can receive the reward?

Usually up to the environment to resolve something like this.
All actions done by the agents are considered to be processed on the same frame, but internally e.g. the game may handle the Agent 1 action first.
For turn-based games you need to do something a bit more complicated perhaps.

Does the true multi-agent envs you mentioned refer to the real environment or other meanings?
In a single agent environment, batch_size=num_envs * num_steps, while in a multi-agent environment, batch_size=num_envs * num_steps * num_agents . Is it only valid in a vectorized envs? Is my understanding correct?
I would like to know how num_agents determines how many decisions need to be generated to advance the environment, so which part of the code should I learn about?

By "true multi-agent" I meant the environments where agents actually interact with each other, i.e. agents are present in the same virtual world. In VizDoom duel scenario, agents play against one another.
In IsaacGym num_agents can be 2048 or more, but these agents don't interact with one another, hence it's not a "true" multi-agent env, but we handle it as such for simplicity

Thank you very much for your prompt reply!

Usually up to the environment to resolve something like this.
All actions done by the agents are considered to be processed on the same frame, but internally e.g. the game may handle the Agent 1 action first.
For turn-based games you need to do something a bit more complicated perhaps.

I seem to understand what you mean. I think I should learn some knowledge about self-play next.

By "true multi-agent" I meant the environments where agents actually interact with each other, i.e. agents are present in the same virtual world. In VizDoom duel scenario, agents play against one another.
In IsaacGym num_agents can be 2048 or more, but these agents don't interact with one another, hence it's not a "true" multi-agent env, but we handle it as such for simplicity

Besides, I would like to confirm if my understanding of the formula for batch size is accurate, that is, batch_size=num_envs * num_steps * num_agents?