/AlphaZero-gym

An OpenAI gym environment for chess, using the state-action representation method employed in AlphaZero

Primary LanguagePythonMIT LicenseMIT

AlphaZero-gym

An OpenAI gym environment for chess with observations and actions represented in AlphaZero-style

Screen-Shot-2020-10-27-at-2-30-21-PM

This is a modification of my gym-chess module. This implementation represents observations and actions using the feature planes representation method described in the AlphaZero paper, Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Installation

This requires the installation of the python-chess chess game engine library. Depending on your version of installation, you may or may not have the method chess.Board.is_repetition included in your particular version of the package. If not, the source code of this method is included in the sole Python file in this repo.

List of required packages:

Usage

The environment works as does any other gym.Env environment. Some basic functions:

  • env.reset()

  • env.step(P)

    Where P is the policy, in the form of a probability distribution over actions represented as a matrix of shape (8, 8, 73), according to the AlphaZero method:

    Feature Planes
    Queen moves 56
    Knight moves 8
    Underpromotions 9
    Total 73

    Table S2: Action representation used by AlphaZero in Chess and Shogi respectively. The policy is represented by a stack of planes encoding a probability distribution over legal moves; planes correspond to the entries in the table.

    A move in chess may be described in two parts: selecting the piece to move, and then selecting among the legal moves for that piece. We represent the policy π(a|s) by a 8 × 8 × 73 stack of planes encoding a probability distribution over 4,672 possible moves. Each of the 8×8 positions identifies the square from which to “pick up” a piece. The first 56 planes encode possible ‘queen moves’ for any piece: a number of squares [1..7] in which the piece will be moved, along one of eight relative compass directions {N, NE, E, SE, S, SW, W, NW}. The next 8 planes encode possible knight moves for that piece. The final 9 planes encode possible underpromotions for pawn moves or captures in two possible diagonals, to knight, bishop or rook respectively. Other pawn moves or captures from the seventh rank are promoted to a queen.

  • env.observe() or as an attribute, env.state

  • env.legal_move_mask() (function for masking invalid actions/illegal moves in the provided action policy)

  • env.render(mode = '____') (valid render modes are 'human', for visualizing the board in a pygame window, and 'rgb_array' for returning the frame as an RGB array)

  • env.close()

Example

The below code sample simulates a game of chess with AlphaZero-style state-action representations:

# Instantiate the environment
env = Chess()

# Reset the environment
state = env.reset()

# AlphaZero state representation shape for Chess
print (state.shape)
>> (8, 8, 119)

# Generate a random policy
p = np.random.random(size = (8, 8, 73))
# Normalize the policy
p = (p - p.min()) / (p.max() - p.min())

# Simulate a game
while not env.terminal:
  state, reward, terminal, info = env.step(p)
  env.render() # defaults to mode = 'human'
  
# Printout to demonstrate the output of the env.step() function
print(state, reward, terminal, info)

>> [[[0. 0. 0. ... 1. 1. 9.]
     [0. 1. 0. ... 1. 1. 9.]
     [0. 0. 1. ... 1. 1. 9.]
  
    ...
  
     [0. 0. 0. ... 1. 1. 9.]
     [0. 0. 0. ... 1. 1. 9.]
     [0. 0. 0. ... 1. 1. 9.]]] 0 True {'last_move': chess.Move.from_uci('c4a4'), 'turn': False}
     
# Close the environment and the display window
env.close()

Remaining inaccuracies

This is not a total replica of the paper's methods just yet, but there is only one aspect that still doesn't conform exactly to the paper: the board must reverse each move so that each observation is from the perspective of the color whose turn is to move.

This will be implemented shortly.