AlphaZero for 5x5 Go using the MLX framework. The aim is to train a 5x5 Go agent on a M1 Macbook Pro that can consistently beat me. Although 5x5 Go is significantly scaled down from 19x19 Go, it still has a game tree complexity of
The big idea in AlphaZero is that superhuman performance in perfect-information games can be achieved without human knowledge. In practice, AlphaZero achieves this by learning stronger MCTS policies from game data generated through many iterations of self-play. The actual "learning" happens in a deep neural network
In MCTS simulations,
During self-play, at each time step
At each iteration of the simulation:
- If state
$s$ is not in the tree, we evaluate and add$s$ as a node in the tree. we evaluate the state using$f$ to return$P(s, a), v = f(s)$ , where$P(s, a)$ is the probability distribution over all possible actions and$v$ is the estimated value of$s$ ($+v$ or$-v$ depending on perspective of player). In our new node$s$ we,- set
$P(s, a)$ - initialize mean action values
$Q(s, a) = 0$ - initialize action counts
$N(s, a) = 0$ .
- set
- We then select the best move
$a$ to play from the current state$s$ using the modified Upper Confidence Bound (UCB) formula$a = \text{argmax} \left[ Q(s, a) + \text{cpuct} \times P(s, a) \times \frac{\sqrt{N(s, b)}}{1 + N(s, a)} \right]$ . The term added to the right of the action value ensures we continue exploring by favoring unlikely actions.- We then transition to the next state
$s'$ from$s$ by taking action$a$ . We recursively run search over this state$s'$ until the game terminates.
- We then transition to the next state
- When we reach a terminal state
$s_term$ , we've done so by traversing a trajectory of edges$(s_0, a_0), (s_1, a_1), \ldots$. We now backprop the value$v$ (+1 or -1 depending on which turn wins) up the trajectory until we hit our beginning state$s$ . So for every visited edge, we:- update its action value
$Q(s_x, a_x) = \frac{N(s_x, a_x) \times Q(s_x, a_x) + v}{N(s_x, a_x) + 1}$ - increment the visit count of the edge
$N(s_x, a_x)$ by 1.
- update its action value
We can interpret each MCTS simulation as building a game tree using this search control strategy:
-
$Q(s, a) + U(s, a)$ , where$U(s, a) = \text{cpuct} \times P(s, a) \times \frac{\sqrt{N(s, b)}}{1 + N(s, a)}$ -
$\text{cpuct}$ can be seen as a constant determining the level of exploration. - Initially, the search control strategy prefers actions with high probability and low visit count, but asymptotically prefers actions with high action value
Once we exit the simulation loop, we greedily pick
$\pi = \frac{N(s, a)^{\frac{1}{t}}}{\sum_b N(s, b)^{\frac{1}{t}}}$
Notice the policy's likelihood of choosing an action is proportional to its visit count relative to all other actions' visit counts.
After each MCTS simulation, we collect experience data
We then sample training data uniformly from this experience buffer for training. Value/Policy network is updated to match its predictions
git clone https://github.com/pythonlearner1025/MiniZero.git
cd MiniZero
pip install -r requirements.txt
python main.py
Saving the weights and playing against a human player is yet unimplemented. The main.py
file only trains the AlphaZero agent on a 5x5 board for now.