UnityML Collaboration and Competition Project

This project demonstrates how two reinforcement learning agents can learn to play tennis cooperatively. A fully trained example is shown below.

Requirements

Installation

Recommended way of installing the dependencies is via Anaconda. To create a new Python 3.6 environment run

conda create --name myenv python=3.6

Activate the environment with

conda activate myenv

Click here for instructions on how to install the Unity ML-Agents Toolkit.

Visit pytorch.org for instructions on installing pytorch.

Install matplotlib with

conda install -c conda-forge matplotlib

Jupyter should come installed with Anaconda. If not, click here for instructions on how to install Jupyter.

Getting started

The project can be run with the provided jupyter notebooks. Tennis_Observation.ipynb allows one to observe a fully trained agents in the environment. Tennis_Training.ipynb can be used to train a new agents or continue training pre-trained agents. Several pre-trained agents are stored in the savedata folder.

Environment

The environment is a a tennis field with two racks and a net. Each rack has its own observation and action space. The observation space consists of 24 continuous variables (8 variables stacked over three steps), e.g. position from the net. The action space is continuous and two dimensional. There is one action for moving towards the net and away from it, and one for jumping. Each agent receives a reward of +0.1 when it manages to push the ball over the net and a reward of -0.01 when it lets the ball drop to the floor. Whenever an agent drops the ball, the episode is terminated and the environment resets. The episode score is determined as the maximum accumulated non-discounted reward over both agents. The environment is considered solved, when the average maximum reward over 100 consecutive episodes is equal or larger than 0.5, which corresponds to both agents successfully pushing the ball over the net at least 9 times in a row.

Algorithm

For this project Deep Deterministic Policy Gradients (DDPG) are used in a multi-agent setting. A detailed description of DDPG and MADDPG can be found in the linked papers.

Agents

The model uses an actor-critic approach. This is realized by fully connected feed forward artificial neural networks with two hidden layers at 128 units each with ReLU activation. Each agent has an actor network that takes the agent's observation and outputs a two-dimensional action vector. The actions and observations of both agents are then used as input for the shared critic network. All networks use local copies during optimization and then perform soft-updates on their target networks. Both agents also use a shared memory buffer to store experiences and sample values for training.

Training

During training, the agents observe states, predict actions and then observe resulting rewards and follow-up states. The critic then tries to predict the value of each action per state and the actors are trained using gradient descent to maximize this value. After each learning step, the target networks are updated using soft-update. The following parameters were used for training.

parameter	value	description
BUFFER_SIZE	100000	replay buffer size
BATCH_SIZE	128	minibatch size
GAMMA	0.99	discount factor
TAU	0.002	for soft update of target parameters
LR_ACTOR	3e-4	learning rate of the actor
LR_CRITIC	1e-4	learning rate of the critic
WEIGHT_DECAY	0	L2 weight decay - set to 0 to prevent rewards from drowning

Training can be performed on cpu or gpu. The default is cpu and the setting is stored in the device variable of agents.py.

With these settings, the agent should learn to solve the environment in approximately 2200 episodes.

fd17/UnityML-Collaboration-Competition