muzero - pytorch implementation plays cartpole

pytorch implementation of "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" based on his pseudocode. This implementation is intended to be as close as possible to the pseudocode presented.

How is this implementation different with respect to the original paper?

The main difference is that this version uses the uniform distribution to samples data from the replay, instead of using prioratized experience replay.

Muzero plays cartpole

To train your own muzero to play with caterpole you just have to launch muzero_main.py.
To evaluate the average sum of rewards it gets (number of moves that performs before failing (or finishing) the game in the case of caterpole), you can call the test.py function.

Some metrics that it's possible to keep track while training (using tensorboard):

mean_reward: mean rewards of the last 50 games

policy_loss:

value_loss:

reward_loss:

total_loss:

What scores can I expect to get with caterpole?

Getting a score of 200-250+ is very feasable without tweaking parameters.
The problem with cartpole is that the training replay gets less and less crowded with failed games, using prioritized experience replay can be a solution to this problem.