
Pong AI trained using policy gradient-based reinforcement learning

Primary LanguagePython

TensorFlow reinforcement learning Pong agent

A Pong AI trained using policy gradients, implemented using TensorFlow and OpenAI gym, based on Andrej Karpathy's Deep Reinforcement Learning: Pong from Pixels.

After 7,000 episodes of training, the result looks like:


First, install OpenAI Gym and TensorFlow.

Run without any arguments to train the AI from scratch. Checkpoints will be saved every so often (see --checkpoint_every_n_episodes). Run with --load_checkpoint --render to see how an AI trained on ~8,000 episode plays.

installing Gym

OpenAI Gym provides an easy-to-use suite of reinforcement learning tasks. To install Gym, you will need a Python environment setup. It's recommended to use Python 3.5 or later. Follow the steps below to install Gym:

  • First, ensure that you have Python installed on your machine. You can download Python from here.
  • Once Python is installed, open your terminal or command prompt.
  • Use pip to install gym by running the following command:
pip install gym

Installing TensorFlow

TensorFlow is an open-source machine learning framework developed by Google. Here's how to install TensorFlow on your machine:

  • Ensure that you have Python installed on your machine. TensorFlow supports Python 3.6 to 3.9.
  • Open your terminal or command prompt.
  • To install TensorFlow, run the following command:
pip install tensorflow

Understanding The Code


  • Imports and Arguments: Necessary libraries are imported and command line arguments are set up to tweak hyperparameters.
  • Constants: Constants for the actions UP and DOWN are defined along with a dictionary to map actions to policy network outputs.
  • prepro Function: Preprocesses the 210x160x3 uint8 game frame into a simplified 80x80 1D float vector to reduce complexity.
  • discount_rewards Function: Computes discounted rewards over a reward sequence to prioritize more immediate rewards.
  • Environment Setup: Initializes the Gym environment for Pong and sets up the policy network.
  • Training Loop: A continuous loop that represents the training process, where each iteration corresponds to one game episode.
  • Rendering: If args.render is true, the game gets rendered to the screen.
  • Policy Execution: The policy network predicts the probability of moving UP, a random action is sampled, the action is performed in the environment, and the state, action, reward tuple is recorded.
  • Training Procedure: Every args.batch_size_episodes, the policy network is trained with the collected state, action, reward tuples.


  • Network Class: The main policy network class which handles TensorFlow session, network architecture, saving/loading checkpoints, and training.
  • Initialization: Sets up a simple two-layer neural network with ReLU activation, and a sigmoid output layer.
  • forward_pass Method: Makes a forward pass through the network to get the probability of moving UP.
  • Train Method: This method trains the network using a log loss function to encourage taking actions that result in winning and discourage actions that result in loss. Adam optimizer is used for minimizing loss.
  • Checkpointing: Functions load_checkpoint and save_checkpoint are available to save and load training progress.


  • 'Round': one match, in which one player gains a point
  • 'Episode': a set of rounds that make up one game (usually around 20 or so - I'm not sure what logic the game uses to decide this)

Training Time

  • It takes about 500 episodes to see whether the agent is improving or not
  • It takes about 7,000 episodes to get to a stage where the agent is winning half and losing half of the rounds

Changes from Andrej's Code

  • Andrej calculates gradients for each episode, accumulates them over a batch size of 10 episodes, and then applies them all in one go. I think this is based on a recommendation in Asynchronous Methods for Deep Reinforcement Learning. It looked like this was going to be a pain to do in TensorFlow, though, (see e.g. http://stackoverflow.com/q/37710974), so here we just use a batch size of one episode.
  • Andrej uses RMSProp, but here we use Adam. (RMSProp wouldn't work - the AI would never improved - and I was never able to figure out why.)

Lessons Learned

When you have a hypothesis that you want to test, think deliberately about what the cheapest way to test it is.

For example, for a while things weren't working, and while debugging I noticed that Andrej's code initialises his RMSProp gradient history with zeros, while TensorFlow initialises with ones. I hypothesised that this was a key factor, and the test I came up with was to compile a custom version of TensorFlow with RMSProp initialised using zeros. It later occurred to me that a much cheaper test would have been to just change Andrej's code to initialise with ones instead.

Acknowledging explicitly to yourself when you've got a hypothesis you want to test rather than just randomly testing stuff out in a state of flow may help with this.