Multi Agent Tennis - Project-3 Collaboration and Competition

This project is part of Udacity's Deep Reinforcement Learning Nanodegree and is called Project-3 Collaboration and Competition. This model was trained on a MacBook Air 2017 with 8GB RAM and Intel core i5 processor.

Description

The project features a single environment with two agents. Each agent is tasked to play tennis (or table tennis as it seems from the envrionment). The rules are simple the if an agent misses a ball then it is considered a point for the opponent, or if an agent hits the ball out of the table then also it is a point for the opponent. In case the agent gives its opponent a point by doing one of the aforementioned things then it gets a negative reward (or punishment) of -0.01 and if the agent is able to score a point against its opponent then it gets a positive reward of +0.1. This environment is part of Unity ML Agents.

The state space has 8 dimensions that correspond to position and velocity of the ball and racket. The agent can perform 2 continuous actions which correspond to moving towards the net or away from it and jumping of an agent.

The agent's task is episodic and is solved when the agent gets atleast +0.5 over consecutive 100 episodes.

For this task I used a Multi Agent Deep Deterministic Policy Gradeints (MADDPG) which is a Multi Agent Actor-Critic method.

The Actor model takes as input the current 8 dimensional state and passed through two (2) layers of multi layered perceptron with ReLU activation followed by an output layer with four (2) nodes each activated with tanh activation which gives the action to take at the current state.

The Critic model takes as input the current 8 dimensional state and the 2 dimensional action which is passed through two (2) layers of multi-layered perceptron with ReLU activation. After the first layer's activation is computed then only the actions are given as input, so the actions are passed from the second layer. The final layer has a single node activated with linear activation which gives the Q-value for the corresponding (state, action) pair.

Demo

The thing I loved about this is that though both have the same neural network architecture and share the same replay buffer but still one agent is able to win this is due to the stochastic training of neural network unlike a deterministic one ❤️.

Steps to run

Clone the repository:

user@programer:~$ git clone https://github.com/frankhart2018/multi-agent-tennis

Install the requirements:

user@programmer:~$ pip install requirements.txt

Download your OS specific unity environment:
- Linux: click here
- MacOS: click here
- Windows (32 bit): click here
- Windows (64 bit): click here
Update the tennis app location according to your OS in the mentioned placed.
Unzip the downloaded environment file

If you prefer using jupyter notebook then launch the jupyter notebook instance:
```
user@programmer:~$ jupyter-notebook
```
➡️ For re-training the agent use Tennis Multi Agent.ipynb

➡️ For testing the trained agent use Tennis Multi Agent Tester.ipynb

In case you like to run a python script use:

➡️ For re-training the agent type:
```
user@programmer:~$ python train.py
```
➡️ For testing the trained agent use:
```
user@programmer:~$ python test.py
```

Technologies used

Unity ML Agents
PyTorch
NumPy
Matplotlib

Algorithms used

Multi Layered Perceptron.
Multi Agent Deep Deterministic Policy Gradients. To learn more about this algorithm you can read the original paper by OpenAI: Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Model description

The Actor Network has three dense (or fully connected layers). The first two layers have 256 and 128 nodes respectively activated with ReLU activation function. The final (output layer) has 2 nodes and is activated with tanh activation. This network takes in as input the 8 dimensional current state and gives as output 2 to provide the action at current state that the agent is supposed to take.

The Critic Network has three dense (or fully connected layers). The first two layers have 300 and 128 nodes respectively activated with ReLU activation function. The final (output layer) has 2 nodes and is activated with linear activation (no activation at all). This network takes in as input the 8 dimensional current state and 2 dimensional action and gives as output a single real number to provide the Q-value at current state and action taken in that state.

Both of the neural networks used Adam optimizer and Mean Squared Error (MSE) as the loss function.

The following image provides a pictorial representation of the Actor Network model:

The following image provides a pictorial representation of the Critic Network model:

The following image provides the plot for score v/s episode number:

Hyperparameters used

Hyperparameter	Value	Description
Buffer size	100000	Maximum size of the replay buffer
Batch size	256	Batch size for sampling from replay buffer
Gamma (γ)	0.99	Discount factor for calculating return
Tau (τ)	0.01	Hyperparameter for soft update of target parameters
Learning Rate Actor	0.001	Learning rate for the actor neural network
Learning Rate Critic	0.001	Learning rate for the critic neural network