General Python implementation of Monte Carlo Tree Search for the use with Open AI Gym environments.
The MCTS Algorithm is based on the one from muzero-general which is forked from here.
This code was part of my Bachelor Thesis:
The source code of the experiments covered by the thesis can be found here.
Python 3.8 is used. Dependencies are mainly numpy and gym. Simply run:
pip install -r requirements.txt
This implementation follows the common agent-environment scheme. The environment is Wrapped by the Game class defined,
, which ensures that the game's state can be deep copied. The main Game implementations for usage with
OpenAI gym environments are DiscreteGymGame
and ContinuousGymGame
A simple example would be:
import gym
from mcts_general.agent import MCTSAgent
from mcts_general.config import MCTSAgentConfig
from import DiscreteGymGame
# configure agent
config = MCTSAgentConfig()
config.num_simulations = 200
agent = MCTSAgent(config)
# init game
game = DiscreteGymGame(env=gym.make('CartPole-v0'))
state = game.reset()
done = False
reward = 0
# run a trajectory
while not done:
action = agent.step(game, state, reward, done)
state, reward, done = game.step(action)
# game.render() # uncomment for environment rendering
A continuous environment can be set up similarly. Note that you have to choose mu
and sigma
for (Gaussian Normal)
sampling actions. Usually it is a good choice to start with mu
being at the middle of your action space and sigma
being half your action space. So for example for Pendulum-v0
the action space is in [-2 ,2] hence a good choice to
start with is mu = 0.
and sigma = 2.
Example for Continuous Control:
import gym
from mcts_general.agent import ContinuousMCTSAgent
from mcts_general.config import MCTSContinuousAgentConfig
from import ContinuousGymGame
# configure agent
config = MCTSContinuousAgentConfig()
agent = ContinuousMCTSAgent(config)
# init game
game = ContinuousGymGame(env=gym.make('Pendulum-v0'), mu=0., sigma=2.)
state = game.reset()
done = False
reward = 0
while not done:
action = agent.step(game, state, reward, done)
state, reward, done = game.step(action)
Please have a look at the game
package for using different time-discretization during planning, and what
hyper parameters can be chosen in the config
class. You might also find some useful gym.Wrapper
s in
. An extensive example on how to use this implementation for MCTS-research can be found in the
thesis experiments.
If you have any questions regarding this code or want to contribute mail me at: