Udacity Deep Reinforcement Learning Nanodegree Program
- To run the project just execute the main.py file.
- If you are not using a windows environment, you will need to download the corresponding "Soccer" version for you OS system. Mail me if you need more details about the environment .exe file.
- The checkpoint.pth has the expected average score already hit.
- tensorflow: 1.7.1
- Pillow: 4.2.1
- matplotlib
- numpy: 1.11.0
- pytest: 3.2.2
- docopt
- pyyaml
- protobuf: 3.5.2
- grpcio: 1.11.0
- torch: 0.4.1
- pandas
- scipy
- ipykernel
- jupyter: 5.6.0
- The task involves a soccer game with 2 teams, each one having 2 players: 1 striker and one 1 keeper.
- There is no goal defined for default, so I decided to train against a random team until my agents archive a score of 95 wins into 100 games.
- The goalies have 4 actions.
- The strikers have 6 actions.
- The biggest problem in this scenario is to control the exploration vs. exploitation rate. I tried approaches such as Double DQN with an exponential exploration rating decay as well as the DDPG approach with prioritized replay experience for diversification of the experiences on learning, but I couldn't find the right configuration for the hyperparameters that could make the agents converge.
- So I changed the approach for a PPO strategy since this kind of method is easier to configure and controls the exploration very well by itself using probabilistic decisions. After a lot of different implementations, I've reached the current solution.
- There are still some items to improve such as convergence time and multi teams training, but I'm satisfied with the current results. On my last test, I could archive the goal (95 wins in 100 games) with a little more than 5000 episodes and I consider that a great result if I look back to all the tries I made before.
- It was really good for my learning as I had not used PPO approaches at this level before trying this environment and for sure the knowledge acquired here will be very relevant for my next projects.
- Talking about the implementation, it has an actor critic neural model and is using a proximal policy optimization learning function with the trusted region approach. The learning happens after each episode (controlled by the environment), and it uses mini-batches from the episode experiences after the reward calculation using the N-Step method that combines the temporal difference discount with monte carlo tree search exploration (in this case the N-Step range is the role episode).
- For now, I'll try other variations changing when the learning happens and using multi teams for experience gathering. I hope I can archive superhuman results with 5000 episodes or less (the agents are good but not super humans with 5000 episodes).
- One last consideration. To beat a random team looks easier at the beginning, but if you consider that random agents win 1/3 of the games and the draw rate of random games is 1/3, the AI has overcome a big challenge reaching a 95% win rate. It's incredible how a random agent can score with just a few steps.
-
The file with the hyperparameters configuration is the main.py.
-
If you want you can change the model configuration to into the model.py file.
-
The actual configuration of the hyperparameters is:
- Learning Rate Goalie: 8e-5
- Learning Rate Striker: 1e-4
- Gamma: 0.995
- Batch Size: 32
- Epsilon: 0.1
- Entropy Weight: 0.001
-
For the neural models:
-
Actor
- Hidden: (input, 256) - ReLU
- Hidden: (256, 128) - ReLU
- Output: (128, action_size) - Softmax
-
Critic
- Hidden: (input, 256) - ReLU
- Hidden: (256, 128) - ReLU
- Output: (128, 1) - Linear
-