This project is part of Udacity's Deep Reinforcement Learning Nanodegree and is called Project-2 Continuous Control. This model was trained on a MacBook Air 2017 with 8GB RAM and Intel core i5 processor.
The project has two types of environment one with a single agent and the other with twnety (20) agents. Each agent is tasked to follow a green colored ball. If the agent is able to successfully catch up with its corresponding ball then the ball lights up (becomes opaque and turns light green), otherwise it remains transluscent with a dark green color. The environment is part of Unity ML Agents. An agent gets a reward of +0.1 for each step that it can successfully follow its corresponding green ball. This environment does not involve a negative reward, in case the agent cannot follow its corresponding ball for a particular step then it does not get a reward (0 reward).
The state space has 33 dimensions that correspond to position, velocity and angular velocity of an agent. The agent can perform 4 different actions which correspond to torque applied to the two joints of an agent.
The agent's task is episodic and is solved when the agent gets atleast +30 over consecutive 100 episodes.
For this task I used a Deep Deterministic Policy Gradeints (DDPG) which is an Actor-Critic method.
The Actor model takes as input the current 33 dimensional state and passed through two (2) layers of multi layered perceptron with ReLU activation followed by an output layer with four (4) nodes each activated with tanh activation which gives the action to take at the current state.
The Critic model takes as input the current 33 dimensional state and the 4 dimensional action which is passed through two (2) layers of multi-layered perceptron with ReLU activation. After the first layer's activation is computed then only the actions are given as input, so the actions are passed from the second layer. The final layer has a single node activated with linear activation which gives the Q-value for the corresponding (state, action) pair.
The thing I truly loved is even though the agent was trained on the environment with twenty (20) agents but it was able to generalize for a single (1) agent. ❤️. This is the power of generalization of a Neural Network 💪 😎.
- Clone the repository:
user@programer:~$ git clone https://github.com/frankhart2018/reacher-agent
- Install the requirements:
user@programmer:~$ pip install requirements.txt
- Download your OS specific unity environment (single agent):
- Linux: click here
- MacOS: click here
- Windows (32 bit): click here
- Windows (64 bit): click here
- Download your OS specific unity environment (twenty agents):
- Linux: click here
- MacOS: click here
- Windows (32 bit): click here
- Windows (64 bit): click here
- Update the banana app location according to your OS in the mentioned placed.
- Unzip the downloaded environment file
- If you prefer using jupyter notebook then launch the jupyter notebook instance:
user@programmer:~$ jupyter-notebook
➡️ For re-training the agent use Reacher Agent.ipynb
➡️ For testing twenty agents use Reacher Tester.ipynb
➡️ For testing a single agent use Reacher Tester One Agent.ipynbIn case you like to run a python script use:
➡️ For re-training the agent type:
user@programmer:~$ python train.py
➡️ For testing twenty agents use:
user@programmer:~$ python test.py
➡️ For testing a single agent use:
user@programmer:~$ python test-one.py
- Unity ML Agents
- PyTorch
- NumPy
- Matplotlib
- Multi Layered Perceptron.
- Deep Deterministic Policy Gradients. To learn more about this algorithm you can read the original paper by DeepMind: Continuous Control with Deep Reinforcement Learning
The Actor Network has three dense (or fully connected layers). The first two layers have 400 and 300 nodes respectively activated with ReLU activation function. The final (output layer) has 4 nodes and is activated with tanh activation. This network takes in as input the 33 dimensional current state and gives as output 4 to provide the action at current state that the agent is supposed to take.
The Critic Network has three dense (or fully connected layers). The first two layers have 404 and 300 nodes respectively activated with ReLU activation function. The final (output layer) has 4 nodes and is activated with linear activation (no activation at all). This network takes in as input the 33 dimensional current state and 4 dimensional action and gives as output a single real number to provide the Q-value at current state and action taken in that state.
Both of the neural networks used Adam optimizer and Mean Squared Error (MSE) as the loss function.
The following image provides a pictorial representation of the Actor Network model:
The following image provides a pictorial representation of the Critic Network model:
The following image provides the plot for score v/s episode number:
Hyperparameter | Value | Description |
---|---|---|
Buffer size | 100000 | Maximum size of the replay buffer |
Batch size | 128 | Batch size for sampling from replay buffer |
Gamma (γ) | 0.99 | Discount factor for calculating return |
Tau (τ) | 0.001 | Hyperparameter for soft update of target parameters |
Learning Rate Actor | 0.0003 | Learning rate for the actor neural network |
Learning Rate Critic | 0.001 | Learning rate for the critic neural network |