Using Reinforcement learning to solve the persistent monitoring problem using a Multi-Agent setup. A decentralized system using Proximal Policy Optimization (PPO) is trained with various scenarios. There is no cooperation introduced, just a local view of the agent (25x25 sq. units area around the agent) and compressed minimap (50x50 sq. units environment compressed to 25x25 sq. units) is provided as an input to the agent which will then decide to execute one of 4 descrete motions (stay, move up, left, down and right).
The environment is made of descrete element that accumulate a penalty value based on a pre-defined decay rate until the agent observes the element in it visibility region. The sum of all the penalty values of all the elements of the map is used to train the agent. The agent must uncover the right behavior to keep observing every descrete element in the map to achieve high reward (less penalty), hence Persistent Monitoring Problem.
Logs to Discussions and work on the project
-
A single agent was trained on an environment with 2 compartments/ rooms. The final behavior can be seen bellow
-
Two agents were trained on an environment with 2 compartments/ rooms. The final behavior can be seen bellow
-
Two agents were trained on an environment with 2 compartments/ rooms. The final behavior can be seen bellow