This project is part of Udacity's Deep Reinforcement Learning Nanodegree and is called Project-3 Collaboration and Competition. This model was trained on a MacBook Air 2017 with 8GB RAM and Intel core i5 processor.
The project features a single environment with two agents. Each agent is tasked to play tennis (or table tennis as it seems from the envrionment). The rules are simple the if an agent misses a ball then it is considered a point for the opponent, or if an agent hits the ball out of the table then also it is a point for the opponent. In case the agent gives its opponent a point by doing one of the aforementioned things then it gets a negative reward (or punishment) of -0.01 and if the agent is able to score a point against its opponent then it gets a positive reward of +0.1. This environment is part of Unity ML Agents.
The state space has 8 dimensions that correspond to position and velocity of the ball and racket. The agent can perform 2 continuous actions which correspond to moving towards the net or away from it and jumping of an agent.
The agent's task is episodic and is solved when the agent gets atleast +0.5 over consecutive 100 episodes.
For this task I used a Multi Agent Deep Deterministic Policy Gradeints (MADDPG) which is a Multi Agent Actor-Critic method.
The Actor model takes as input the current 8 dimensional state and passed through two (2) layers of multi layered perceptron with ReLU activation followed by an output layer with four (2) nodes each activated with tanh activation which gives the action to take at the current state.
The Critic model takes as input the current 8 dimensional state and the 2 dimensional action which is passed through two (2) layers of multi-layered perceptron with ReLU activation. After the first layer's activation is computed then only the actions are given as input, so the actions are passed from the second layer. The final layer has a single node activated with linear activation which gives the Q-value for the corresponding (state, action) pair.
The thing I loved about this is that though both have the same neural network architecture and share the same replay buffer but still one agent is able to win this is due to the stochastic training of neural network unlike a deterministic one ❤️.
- Clone the repository:
user@programer:~$ git clone https://github.com/frankhart2018/multi-agent-tennis
- Install the requirements:
user@programmer:~$ pip install requirements.txt
- Download your OS specific unity environment:
- Linux: click here
- MacOS: click here
- Windows (32 bit): click here
- Windows (64 bit): click here
- Update the tennis app location according to your OS in the mentioned placed.
- Unzip the downloaded environment file
- If you prefer using jupyter notebook then launch the jupyter notebook instance:
user@programmer:~$ jupyter-notebook
➡️ For re-training the agent use Tennis Multi Agent.ipynb
➡️ For testing the trained agent use Tennis Multi Agent Tester.ipynbIn case you like to run a python script use:
➡️ For re-training the agent type:
user@programmer:~$ python train.py
➡️ For testing the trained agent use:
user@programmer:~$ python test.py
- Unity ML Agents
- PyTorch
- NumPy
- Matplotlib
- Multi Layered Perceptron.
- Multi Agent Deep Deterministic Policy Gradients. To learn more about this algorithm you can read the original paper by OpenAI: Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
The Actor Network has three dense (or fully connected layers). The first two layers have 256 and 128 nodes respectively activated with ReLU activation function. The final (output layer) has 2 nodes and is activated with tanh activation. This network takes in as input the 8 dimensional current state and gives as output 2 to provide the action at current state that the agent is supposed to take.
The Critic Network has three dense (or fully connected layers). The first two layers have 300 and 128 nodes respectively activated with ReLU activation function. The final (output layer) has 2 nodes and is activated with linear activation (no activation at all). This network takes in as input the 8 dimensional current state and 2 dimensional action and gives as output a single real number to provide the Q-value at current state and action taken in that state.
Both of the neural networks used Adam optimizer and Mean Squared Error (MSE) as the loss function.
The following image provides a pictorial representation of the Actor Network model:
The following image provides a pictorial representation of the Critic Network model:
The following image provides the plot for score v/s episode number:
Hyperparameter | Value | Description |
---|---|---|
Buffer size | 100000 | Maximum size of the replay buffer |
Batch size | 256 | Batch size for sampling from replay buffer |
Gamma (γ) | 0.99 | Discount factor for calculating return |
Tau (τ) | 0.01 | Hyperparameter for soft update of target parameters |
Learning Rate Actor | 0.001 | Learning rate for the actor neural network |
Learning Rate Critic | 0.001 | Learning rate for the critic neural network |