Understanding Training and Self-play agents
nil123532 opened this issue · 4 comments
Hello,
Firstly, thank you for providing such a comprehensive GitHub repository on multi-agent RL. I'm new to the field of Reinforcement Learning and had some questions regarding the project:
In the human_aware_rl/ppo directory, it appears that a PPO agent is trained alongside a pre-trained Behavioral Cloning (BC) agent. Could you provide some guidance on how to modify this setup to train two PPO agents together, similar to the approach taken in PantheonRL?
The human_aware_rl/imitation directory suggests that a BC agent is trained using previously collected human data. Could you confirm this?
I'm particularly interested in understanding which of these setups qualifies as self-play. My assumption is that the first case might be considered self-play, but given that one agent is a BC agent, I'm not sure if this meets the traditional definition of self-play, such as the approach used in PantheonRL where you can train a PPO ego agent and PPO alt agent in stable-baselines3.
Thank you for your time and looking forward to your response.
Best regards
Hi! Thanks for reaching out!
This file runs all the experiments in true self play, and this one with PPO-BC. I believe the python file automatically figures out which type of training run you're trying to run based on the arguments you're passing in – you can verify this yourself following the execution starting from this file.
Training with BC is called BC-play, PPO-BC, or human-aware-RL (as in our paper). I'd recommend reading it for more intuition on the various setups! You're right that any of these do not count as self-play.
Yes, the BC agents in the imitation
directory by default uses human gameplay data that we collected. Again, I encourage you to double check the code as to how things are done exactly.
Let me know if you have any other questions!
Thank you for the quick and detailed reply; it clarified many aspects for me. I've gone through the scripts and observed how the sacred library manages experiment settings based on passed arguments, which is quite impressive.
I have a couple more questions I'd like to explore:
In the context of PPO-self-play, are both agents being trained, or is it just one? In other words, does each agent have its own distinct policy, or is there one unified policy that both agents follow? My understanding is a unified policy right?
If I've understood correctly, the agents are initialized in the constructor, and the joint_action
variable in step
method is used to step through the environment, receiving rewards in return. Could you confirm if my understanding is accurate?
I'm thoroughly impressed by your work and eager to understand it more deeply.
Thank you again for taking the time to assist me.
Ahah!
Thank you so much!