This is an implementation of proximal policy optimization I made as part of my work getting up to speed in Deep RL under a grant from the Machine Intelligence Research Institute. I left many of my debug notes in intentionally, in hopes they might be helpful to anyone else attempting a similar project. Also see my postmortem for a breakdown of (known) errors made during the implementation.
If all prerequisites are installed, type "python ppo.py" to run in the default configuration on Pong.
Here's a graph of performance on Pong with the default configuration: