Reacher Agent - Project-2 Continuous Control

This project is part of Udacity's Deep Reinforcement Learning Nanodegree and is called Project-2 Continuous Control. This model was trained on a MacBook Air 2017 with 8GB RAM and Intel core i5 processor.

Description

The project has two types of environment one with a single agent and the other with twnety (20) agents. Each agent is tasked to follow a green colored ball. If the agent is able to successfully catch up with its corresponding ball then the ball lights up (becomes opaque and turns light green), otherwise it remains transluscent with a dark green color. The environment is part of Unity ML Agents. An agent gets a reward of +0.1 for each step that it can successfully follow its corresponding green ball. This environment does not involve a negative reward, in case the agent cannot follow its corresponding ball for a particular step then it does not get a reward (0 reward).

The state space has 33 dimensions that correspond to position, velocity and angular velocity of an agent. The agent can perform 4 different actions which correspond to torque applied to the two joints of an agent.

The agent's task is episodic and is solved when the agent gets atleast +30 over consecutive 100 episodes.

For this task I used a Deep Deterministic Policy Gradeints (DDPG) which is an Actor-Critic method.

The Actor model takes as input the current 33 dimensional state and passed through two (2) layers of multi layered perceptron with ReLU activation followed by an output layer with four (4) nodes each activated with tanh activation which gives the action to take at the current state.

The Critic model takes as input the current 33 dimensional state and the 4 dimensional action which is passed through two (2) layers of multi-layered perceptron with ReLU activation. After the first layer's activation is computed then only the actions are given as input, so the actions are passed from the second layer. The final layer has a single node activated with linear activation which gives the Q-value for the corresponding (state, action) pair.

Demo

The thing I truly loved is even though the agent was trained on the environment with twenty (20) agents but it was able to generalize for a single (1) agent. ❤️. This is the power of generalization of a Neural Network 💪 😎.

Steps to run

Clone the repository:

user@programer:~$ git clone https://github.com/frankhart2018/reacher-agent

Install the requirements:

user@programmer:~$ pip install requirements.txt

Download your OS specific unity environment (single agent):
- Linux: click here
- MacOS: click here
- Windows (32 bit): click here
- Windows (64 bit): click here
Download your OS specific unity environment (twenty agents):
- Linux: click here
- MacOS: click here
- Windows (32 bit): click here
- Windows (64 bit): click here
Update the banana app location according to your OS in the mentioned placed.
Unzip the downloaded environment file

If you prefer using jupyter notebook then launch the jupyter notebook instance:
```
user@programmer:~$ jupyter-notebook
```
➡️ For re-training the agent use Reacher Agent.ipynb

➡️ For testing twenty agents use Reacher Tester.ipynb

➡️ For testing a single agent use Reacher Tester One Agent.ipynb

In case you like to run a python script use:

➡️ For re-training the agent type:
```
user@programmer:~$ python train.py
```
➡️ For testing twenty agents use:
```
user@programmer:~$ python test.py
```
➡️ For testing a single agent use:
```
user@programmer:~$ python test-one.py
```

Technologies used

Unity ML Agents
PyTorch
NumPy
Matplotlib

Algorithms used

Multi Layered Perceptron.
Deep Deterministic Policy Gradients. To learn more about this algorithm you can read the original paper by DeepMind: Continuous Control with Deep Reinforcement Learning

Model description

The Actor Network has three dense (or fully connected layers). The first two layers have 400 and 300 nodes respectively activated with ReLU activation function. The final (output layer) has 4 nodes and is activated with tanh activation. This network takes in as input the 33 dimensional current state and gives as output 4 to provide the action at current state that the agent is supposed to take.

The Critic Network has three dense (or fully connected layers). The first two layers have 404 and 300 nodes respectively activated with ReLU activation function. The final (output layer) has 4 nodes and is activated with linear activation (no activation at all). This network takes in as input the 33 dimensional current state and 4 dimensional action and gives as output a single real number to provide the Q-value at current state and action taken in that state.

Both of the neural networks used Adam optimizer and Mean Squared Error (MSE) as the loss function.

The following image provides a pictorial representation of the Actor Network model:

The following image provides a pictorial representation of the Critic Network model:

The following image provides the plot for score v/s episode number:

Hyperparameters used

Hyperparameter	Value	Description
Buffer size	100000	Maximum size of the replay buffer
Batch size	128	Batch size for sampling from replay buffer
Gamma (γ)	0.99	Discount factor for calculating return
Tau (τ)	0.001	Hyperparameter for soft update of target parameters
Learning Rate Actor	0.0003	Learning rate for the actor neural network
Learning Rate Critic	0.001	Learning rate for the critic neural network