/Soccer-PPO

Udacity Deep Reinforcement Learning Nanodegree Program

Primary LanguageASP

DRL - PPO - Soccer Twos

Udacity Deep Reinforcement Learning Nanodegree Program

Observations:

  • To run the project just execute the main.py file.
  • If you are not using a windows environment, you will need to download the corresponding "Soccer" version for you OS system. Mail me if you need more details about the environment .exe file.
  • The checkpoint.pth has the expected average score already hit.

Requeriments:

  • tensorflow: 1.7.1
  • Pillow: 4.2.1
  • matplotlib
  • numpy: 1.11.0
  • pytest: 3.2.2
  • docopt
  • pyyaml
  • protobuf: 3.5.2
  • grpcio: 1.11.0
  • torch: 0.4.1
  • pandas
  • scipy
  • ipykernel
  • jupyter: 5.6.0

The problem:

  • The task involves a soccer game with 2 teams, each one having 2 players: 1 striker and one 1 keeper.
  • There is no goal defined for default, so I decided to train against a random team until my agents archive a score of 95 wins into 100 games.
  • The goalies have 4 actions.
  • The strikers have 6 actions.

The solution:

  • The biggest problem in this scenario is to control the exploration vs. exploitation rate. I tried approaches such as Double DQN with an exponential exploration rating decay as well as the DDPG approach with prioritized replay experience for diversification of the experiences on learning, but I couldn't find the right configuration for the hyperparameters that could make the agents converge.
  • So I changed the approach for a PPO strategy since this kind of method is easier to configure and controls the exploration very well by itself using probabilistic decisions. After a lot of different implementations, I've reached the current solution.
  • There are still some items to improve such as convergence time and multi teams training, but I'm satisfied with the current results. On my last test, I could archive the goal (95 wins in 100 games) with a little more than 5000 episodes and I consider that a great result if I look back to all the tries I made before.
  • It was really good for my learning as I had not used PPO approaches at this level before trying this environment and for sure the knowledge acquired here will be very relevant for my next projects.
  • Talking about the implementation, it has an actor critic neural model and is using a proximal policy optimization learning function with the trusted region approach. The learning happens after each episode (controlled by the environment), and it uses mini-batches from the episode experiences after the reward calculation using the N-Step method that combines the temporal difference discount with monte carlo tree search exploration (in this case the N-Step range is the role episode).
  • For now, I'll try other variations changing when the learning happens and using multi teams for experience gathering. I hope I can archive superhuman results with 5000 episodes or less (the agents are good but not super humans with 5000 episodes).
  • One last consideration. To beat a random team looks easier at the beginning, but if you consider that random agents win 1/3 of the games and the draw rate of random games is 1/3, the AI has overcome a big challenge reaching a 95% win rate. It's incredible how a random agent can score with just a few steps.

The hyperparameters:

  • The file with the hyperparameters configuration is the main.py.

  • If you want you can change the model configuration to into the model.py file.

  • The actual configuration of the hyperparameters is:

    • Learning Rate Goalie: 8e-5
    • Learning Rate Striker: 1e-4
    • Gamma: 0.995
    • Batch Size: 32
    • Epsilon: 0.1
    • Entropy Weight: 0.001
  • For the neural models:

    • Actor

      • Hidden: (input, 256) - ReLU
      • Hidden: (256, 128) - ReLU
      • Output: (128, action_size) - Softmax
    • Critic

      • Hidden: (input, 256) - ReLU
      • Hidden: (256, 128) - ReLU
      • Output: (128, 1) - Linear