________ _______ _______ ________ ________ ___ ___ _______ ________ ________
|\ ___ \|\ ___ \ |\ ___ \ |\ __ \ |\ ____\|\ \|\ \|\ ___ \ |\ ____\ |\ ____\
\ \ \_|\ \ \ __/|\ \ __/|\ \ \|\ \ \ \ \___|\ \ \\\ \ \ __/|\ \ \___|_\ \ \___|_
\ \ \ \\ \ \ \_|/_\ \ \_|/_\ \ ____\ \ \ \ \ \ __ \ \ \_|/_\ \_____ \\ \_____ \
\ \ \_\\ \ \ \_|\ \ \ \_|\ \ \ \___| \ \ \____\ \ \ \ \ \ \_|\ \|____|\ \\|____|\ \
\ \_______\ \_______\ \_______\ \__\ \ \_______\ \__\ \__\ \_______\____\_\ \ ____\_\ \
\|_______|\|_______|\|_______|\|__| \|_______|\|__|\|__|\|_______|\_________\\_________\
\|_________\|_________|
December 2021 - February 2022
Authors : A. Bellamine, L-D. Azoulay, N. Berrebi
Reinforcement Learning project in term of the lecture of E. LE PENNEC in the Data Science Master of the Paris Polytechnique institute (M2DS).
Develop an implementation of the Alpha Zero algorithm for chess.
This implementation will be designed to run on a single computer with an Intel i7 7700K and an NVIDA 1080 GTX Ti GPU.
The main purpuse of this development is to train the algorithm against stockFish and evaluate if :
- The algorithm can actually improve its performance
- Can beat a chess debutant player with small training duration (a few days)
This tool is composed of :
- A dedicated python chess engine (chessBoard)
- A deep CNN neural network which try to predict the next move probability distribution and associated reward
- A player class which can play chess with the current chess engine board according to their policy. There is actually 3 players :
- kStockFish player, a player composed of random play and stockfish play. With this player, the next move is either pick randomly (with a probability 1-k) or played by stockFish (with a probability k)
- deepChessPlayer : player which play according to the neural network policy exclusivelly
- An MCTS class, which pick moves according to MCTS simulations back-propagation
To accelerate the move calculation, a failover chessBoard, which has less functionality, was created in the chessBoardFast sub-library.
This chessboard is used for selfPlay and MCTS.
It use the open source shallow-blue software for which we are redistributing the compiled linux binary.
The deepChess tool is given with three python file script :
- selfPlay.py
- trainNN.py
- evaluation.py
The two script can be run concurently.
The selfPlay script produce chess game records generated by an algorithm versus algorithm play. In the selfPlay algorithm, the recorder player is always the player0. The player which start the game is always randomly picked, it is either the player 0 or 1.
The selfplay can be configured to change :
- The player 0 MCTS first move policy (FMP)
- The player 0 MCTS next moves policies (NMP)
- The player 1 moves policie
The MCTS is implemented almost as described in A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.
The additional noise is to be added for a full implementation.
Multiple selfplay can be run concurrently by execution the script a multiple number of time.
We choosed a slightly difference approach than D. Silver and al on this point due to our lack of computational power :
- We made smaller MTCS number of simulations : 20 instead of 800 (it needs too much computation)
- The Player 1 and the Player 0 MCTS NMP where produced by respectively a 0.8-stockFish player and a 0.2-stockFish player. Our aim is to stimulate the discovery of interesting move and games at the beginning of the training
- In a second time, we are aiming to train the algorithm from real selfPlay games : neural network player versus neural network player
The selfPlay parameters can be changed in selfPlay.py :
data_folder # Folder in which the recorded games will be saved
model_folder = # Folder in which the models are saved, it will always pick the most recent one
log_folder = # Folder in which to save the selfplay logs in tensorboard format
device = # Device in which to load the neural network (cpu or cuda:#)
stockFish_path = # Path of the stockFish executable
players = {
0:{
'k': # K parameter of player 0,
'keep_history': False
},
1:{
'k': # K parameter of player 1,
'keep_history':False
}
}
n_mtcs = # Number of simulations to run in MCTS
n_games = # Number of game to play
The train NN trigger the neural network train according to the recorded games. The loss is implemented as described in Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (D. Silver and al.).
Each game record is used for training the neural network and is seen a determined number of time by the network.
This number can be set in the max_epoch
variable.
Script that process to self play against a player. We configured it as a 0.1-stockFish player against a full neural network MCTS 1 search play (otherwise it would have been to slow). It record the performance of each game in a csv file.
- Reinforcement Learning: An Introduction (Richard S. Sutton, Andrew G. Barto)
- Original alpha zero article : Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (D. Silver and al.)
- Supplementary of another D. Silver and al. description of Alpha Zero, for technical implementation detail : A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
- Insights about CNN architecture : Move Evaluation in Go Using Deep Convolutional Neural Networks (Chris J. Maddison and al.)
- Insights for the moves reprensentation in neural network output : How does the Alpha Zero's move encoding work? (ai.stackexchange.com)
- Stockfish project
- The shallow blue is an open-source UCI like chess engine in MIT license from Rhys Rustad-Elliott, it can be downloaded in the author github pages : shallowBlue