/chess-deep-rl

Research project: create a chess engine using Deep Reinforcement Learning

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Chess engine with Deep Reinforcement learning

I'm currently rewriting the whole thing in C++, you can check it out here.

You can read my bachelor thesis about this project here.

Installation and user manual

The installation manual and the user manual can both be found under ./documents/

To download my pretrained model, use this link: https://www.mediafire.com/file/75mzcj2aqcs6g6z/model.h5/file

Put the model.h5 file in the models/ folder.

Warning

Keep in mind this model has not been trained very well at all due to lack of compute resources. It's probably better to train your own model, but keep in mind you'd need a lot of compute power. I'm only posting it here for people to try out a model that has gone through a few training pipelines.

How do normal chess engines work?

Normal chess engines work with the minimax algorithm: the engine tries to find the best move by creating a tree of all possible moves to a certain depth, and cutting down paths that lead to bad positions (alpha-beta pruning). It evaluates a position based on which pieces are on the board.

Alpha-Beta pruning in Minimax

Image source: By Jez9999, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3708424

How does my chess engine work?

This chess engine is based on AlphaZero by Deepmind. It uses a neural network to predict the next best move. The neural network learns by playing against itself for a high amount of games, and using their results to train the network. The newly trained neural network is evaluated against the old network by playing many games against each other, and the best network is kept. This process is repeated for a long time.

Playing one move

The neural network

  • Input layer: 19 8x8 boards of booleans

Input example

  • 20 hidden layers:
    • Convolutional hidden layer
    • 19 residual blocks with skip-connections
  • 2 outputs:
    1. The win probabilities of each move (73 boards of 8x8 floats)
    2. The value of the given board (scalar)

Output example

=> 30+ million parameters

A visual representation of the model can be found in ./models/model.png

Every move, run a high number amount of MCTS simulations. AlphaZero uses an custom version of MCTS.

Normal Monte Carlo Tree Search:

https://en.wikipedia.org/wiki/Monte_Carlo_tree_search

  1. Selection: Traverse the tree randomly until a leaf node is reached.
  2. Expansion: expand the leaf node by creating a child for every possible action
  3. Simulation: 'rollout' the game by randomly choosing moves until the end of the game.
  4. Backpropagation: backpropagate the result of the rollout to the root node.

In chess, normal MCTS would be incredibly inefficient, because the amount of actions every position can have is too high (step 1), and the length of the game can be very long when choosing random moves (step 3).

Monte Carlo Tree Search

Image source: By Rmoss92 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=88889583

AlphaZero's MCTS

AlphaZero uses a different kind of MCTS:

  • step 1 (Selection) is not random, but based on neural network predictions and upper confidence bound
  • step 3 (Simulation) is replaced by the value prediction received by the neural network (Evaluation)

MCTS steps for 1 simulation

Image source: https://sebastianbodenstein.net/post/alphazero/

To run one MCTS simulation:

  1. To traverse the tree, keep selecting the edges with maximum Q+U value
    • Q = mean value of the state over all simulations
    • U = upper confidence bound
    • Do this until a leaf node is reached (= a node which has not been visited/expanded yet)
  2. Expand the leaf node by adding a new edge for every possible action in the state
    • Input the leaf node into the neural network
    • The output:
      1. The probabilities
      2. The value of the state
    • Initialize the new edge's variables with these values:
      • N = 0
      • W = 0
      • Q = 0
      • P = p_a (prior probability for that action)
    • Add nodes (new states) for each action to the tree
  3. Backpropagation
    • From the leaf node, backpropagate to the root node
    • For every edge in the path, update the edge's variables
      • N = N + 1
      • W = W + v, v is the value of the leaf node predicted by the NN in step 2.
      • Q = W / N

After these simulations, the move can be chosen:

  • The move with greatest $N$ (deterministically)
  • According to a distribution (stochastically): $\pi \sim N$

Choose move from tree

Creating a training set

  • To train the network, you need a lot of data
  • You create this data through self-play: letting the AI play against a copy of itself for many games.
  • For every move, store:
    • The state
    • The search probabilities
    • The winner, (added once the game is over)

Training the network

  • Sample a mini-batch from a high amount of positions (see training set)
  • Train the network on the mini-batch

Creating a training set

Trophy icon by Freepik https://www.flaticon.com/authors/freepik

First training session Second training session
First training session Second training session

The first training session went pretty well, but the second didn't seem to train much at all. I believe I would need to generate a lot more data through selfplay to properly train the model.

Multi-processing improvements

It is necessary to create a huge training set of positions by making the current best AI play against itself. To do that, I had the problem that playing multiple games in parallel was not possible because every agent needs access to the network:

Self-play without multiprocessing

To fix this, I created a server-client architecture with Python sockets: the server has access to the neural network, and the client sends predictions to the server. The server then sends the predictions back to the correct client. This is much more scalable and can be dockerized.

Self-play with multiprocessing

With a good system as a server (Ryzen 7 5800H + RTX 3070 Mobile), multiple clients (including clients on the system itself) can be connected to the server.

The result: much faster self-play. The other clients' GPUs do not get used, meaning any system with a good processor can run multiple self-play games in parallel when connected to a server.

System No multiprocessing Multiprocessing (16 processes)
R7 5800H + RTX 3070 50 sims/sec 30 sims/sec each process
i7 7700HQ + GTX 1050 20 sims/sec 15 sims/sec each process

I dockerized this server-client system so it can be deployed on a cluster. You can find the configuration in code/docker-compose.yml, and the Dockerfiles in code/Dockerfile{client,server}. The docker images are also pushed to ghcr.io:

Evaluate the network

To know whether the new network is better than the previous one, let the new network play against the previous best for a high amount of games. Whoever wins the most games, is the new best network.

Use that network to self-play again. Repeat indefinitely.

I tried this with the newest network against a completely random neural network. These are the results after 10 games:

Evaluated these models: Model 1 = models/randommodel.h5, Model 2 = models/model.h5
The results:
Model 1: 0
Model 2: 5
Draws: 5

Sources

Wikipedia articles & Library documentation

AlphaZero & AlphaGo Zero specific articles & papers

Diagrams

Tutorials

Interesting videos

Hits