PPO (Proximal Policy Optimization)
For this project, two agents control rackets are trained to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.
The task is episodic, and the environment is considered to be solved, when the agents get an average score of +1 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,
- After each episode, the rewards are added up that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. Then the maximum of these 2 scores is taken into account.
- This yields a single score for each episode.
- Observation space type: continuous
- Observation space size (per agent): 8, corresponding to:
- position and velocity of ball and racket
- Observation space size (per agent): 8, corresponding to:
- Action space type: discrete
- Action space size (per agent): 2 (continuous), corresponding to:
- movement toward net or away from net, and jumping
- Action space size (per agent): 2 (continuous), corresponding to:
- conda create --name colabcompet python=3.6
- conda activate colabcompet
- conda install jupyter
- pip install gym (make sure that pip is acting in your environment "type pip")
- conda install pytorch=0.4.0 -c pytorch
- cd C:\Users\YOUR_USERNAME\Documents\GitHub\Collaboration-Competition\python
- pip install .
- install XQuartz from here: https://www.xquartz.org (remember to restart your mac)
- Download the pre-compiled Unity Environment to "data" folder (20 Agents):
- Linux: click here
- Mac OSX: click here
- Windows (32-bit): click here
- Windows (64-bit): click here
- Follow the instructions in
colabcompet.ipynb
to get started!