This project demonstrates how two reinforcement learning agents can learn to play tennis cooperatively. A fully trained example is shown below.
- Windows (64 bit)
- Python 3.6
- Unity ML-Agents Toolkit
- Pytorch
- Matplotlib
- Jupyter
Recommended way of installing the dependencies is via Anaconda. To create a new Python 3.6 environment run
conda create --name myenv python=3.6
Activate the environment with
conda activate myenv
Click here for instructions on how to install the Unity ML-Agents Toolkit.
Visit pytorch.org for instructions on installing pytorch.
Install matplotlib with
conda install -c conda-forge matplotlib
Jupyter should come installed with Anaconda. If not, click here for instructions on how to install Jupyter.
The project can be run with the provided jupyter notebooks. Tennis_Observation.ipynb allows one to observe a fully trained agents in the environment. Tennis_Training.ipynb can be used to train a new agents or continue training pre-trained agents. Several pre-trained agents are stored in the savedata
folder.
The environment is a a tennis field with two racks and a net. Each rack has its own observation and action space. The observation space consists of 24 continuous variables (8 variables stacked over three steps), e.g. position from the net. The action space is continuous and two dimensional. There is one action for moving towards the net and away from it, and one for jumping. Each agent receives a reward of +0.1 when it manages to push the ball over the net and a reward of -0.01 when it lets the ball drop to the floor. Whenever an agent drops the ball, the episode is terminated and the environment resets. The episode score is determined as the maximum accumulated non-discounted reward over both agents. The environment is considered solved, when the average maximum reward over 100 consecutive episodes is equal or larger than 0.5, which corresponds to both agents successfully pushing the ball over the net at least 9 times in a row.
For this project Deep Deterministic Policy Gradients (DDPG) are used in a multi-agent setting. A detailed description of DDPG and MADDPG can be found in the linked papers.
The model uses an actor-critic approach. This is realized by fully connected feed forward artificial neural networks with two hidden layers at 128 units each with ReLU activation. Each agent has an actor network that takes the agent's observation and outputs a two-dimensional action vector. The actions and observations of both agents are then used as input for the shared critic network. All networks use local copies during optimization and then perform soft-updates on their target networks. Both agents also use a shared memory buffer to store experiences and sample values for training.
During training, the agents observe states, predict actions and then observe resulting rewards and follow-up states. The critic then tries to predict the value of each action per state and the actors are trained using gradient descent to maximize this value. After each learning step, the target networks are updated using soft-update. The following parameters were used for training.
parameter | value | description |
---|---|---|
BUFFER_SIZE | 100000 | replay buffer size |
BATCH_SIZE | 128 | minibatch size |
GAMMA | 0.99 | discount factor |
TAU | 0.002 | for soft update of target parameters |
LR_ACTOR | 3e-4 | learning rate of the actor |
LR_CRITIC | 1e-4 | learning rate of the critic |
WEIGHT_DECAY | 0 | L2 weight decay - set to 0 to prevent rewards from drowning |
Training can be performed on cpu or gpu. The default is cpu and the setting is stored in the device
variable of agents.py.
With these settings, the agent should learn to solve the environment in approximately 2200 episodes.