AlphaZero-gym
An OpenAI gym environment for chess with observations and actions represented in AlphaZero-style
This is a modification of my gym-chess
module. This implementation represents observations and actions using the feature planes representation method described in the AlphaZero paper, Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Installation
This requires the installation of the python-chess
chess game engine library. Depending on your version of installation, you may or may not have the method chess.Board.is_repetition
included in your particular version of the package. If not, the source code of this method is included in the sole Python file in this repo.
List of required packages:
python-chess
pip install chess
cairosvg
pip install cairosvg
PIL
pip install Pillow
gym
pip install 'gym[all]'
Usage
The environment works as does any other gym.Env
environment. Some basic functions:
-
env.reset()
-
env.step(P)
Where
P
is the policy, in the form of a probability distribution over actions represented as a matrix of shape (8, 8, 73), according to the AlphaZero method:Feature Planes Queen moves 56 Knight moves 8 Underpromotions 9 Total 73 A move in chess may be described in two parts: selecting the piece to move, and then selecting among the legal moves for that piece. We represent the policy π(a|s) by a 8 × 8 × 73 stack of planes encoding a probability distribution over 4,672 possible moves. Each of the 8×8 positions identifies the square from which to “pick up” a piece. The first 56 planes encode possible ‘queen moves’ for any piece: a number of squares [1..7] in which the piece will be moved, along one of eight relative compass directions {N, NE, E, SE, S, SW, W, NW}. The next 8 planes encode possible knight moves for that piece. The final 9 planes encode possible underpromotions for pawn moves or captures in two possible diagonals, to knight, bishop or rook respectively. Other pawn moves or captures from the seventh rank are promoted to a queen.
-
env.observe()
or as an attribute,env.state
-
env.legal_move_mask()
(function for masking invalid actions/illegal moves in the provided action policy) -
env.render(mode = '____')
(valid render modes are'human'
, for visualizing the board in a pygame window, and'rgb_array'
for returning the frame as an RGB array) -
env.close()
Example
The below code sample simulates a game of chess with AlphaZero-style state-action representations:
# Instantiate the environment
env = Chess()
# Reset the environment
state = env.reset()
# AlphaZero state representation shape for Chess
print (state.shape)
>> (8, 8, 119)
# Generate a random policy
p = np.random.random(size = (8, 8, 73))
# Normalize the policy
p = (p - p.min()) / (p.max() - p.min())
# Simulate a game
while not env.terminal:
state, reward, terminal, info = env.step(p)
env.render() # defaults to mode = 'human'
# Printout to demonstrate the output of the env.step() function
print(state, reward, terminal, info)
>> [[[0. 0. 0. ... 1. 1. 9.]
[0. 1. 0. ... 1. 1. 9.]
[0. 0. 1. ... 1. 1. 9.]
...
[0. 0. 0. ... 1. 1. 9.]
[0. 0. 0. ... 1. 1. 9.]
[0. 0. 0. ... 1. 1. 9.]]] 0 True {'last_move': chess.Move.from_uci('c4a4'), 'turn': False}
# Close the environment and the display window
env.close()
Remaining inaccuracies
This is not a total replica of the paper's methods just yet, but there is only one aspect that still doesn't conform exactly to the paper: the board must reverse each move so that each observation is from the perspective of the color whose turn is to move.
This will be implemented shortly.