This task benchmarks reinforcement learning for the 9x9 version of the boardgame go. The model plays games against itself and uses these games to improve play.
All training data is generated during the selfplay phase of the RL loop.
The only data to be downloaded are the starting checkpoint and the target model. These are downloaded automatically before the training starts.
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1
single node submission are in the config_DGX1.sh
script.
Steps required to launch single node training on NVIDIA DGX-1:
docker build --pull -t mlperf-nvidia:minigo .
LOGDIR=<path/to/output/dir> CONT=mlperf-nvidia:minigo DGXSYSTEM=DGX1 ./run.sub
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1
multi node submission are in the config_DGX1_multi.sh
script.
Steps required to launch multi node training on NVIDIA DGX-1:
- Build the docker container and push to a docker registry
docker build --pull -t <docker/registry>/mlperf-nvidia:minigo.
docker push <docker/registry>/mlperf-nvidia:minigo
- Launch the training
source config_DGX1_multi.sh && CONT="<docker/registry>/mlperf-nvidia:minigo" LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub
This benchmark is based on a fork of the minigo project (https://github.com/tensorflow/minigo); which is inspired by the work done by Deepmind with "Mastering the Game of Go with Deep Neural Networks and Tree Search", "Mastering the Game of Go without Human Knowledge", and "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm". Note that minigo is an independent effort from AlphaGo, and that this fork is minigo is independent from minigo itself.
This benchmark includes both the environment and training for 9x9 go. There are three primary phases performed in each iteration:
- Selfplay: the current best model plays games against itself to produce board positions for training.
- Training: train the neural networks selfplay data from several recent models.
- Model Evaluation: the current best and the most recently trained model play a series of games to establish if the current model should replace the current best
Target evaluation is performed after completing the training (please see the Quality section below for more details).
This task has a non-trivial network structure, including a search tree. A good overview of the structure can be found here: https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0.
Network weights are initialized with a fixed checkpoint downloaded before the training starts. Loss function is described here; "Mastering the Game of Go with Deep Neural Networks and Tree Search"
We use a MomentumOptimizer to train the primary network.
Quality is measured by the number of games won out of 100 against a fixed target model. The target model is downloaded before automatically before the training starts.
The target is to win at least 50 out of 100 games against the target model.
Evaluations are performed after completing the training and are not timed. Checkpoints from every RL loop iteration are evaluated. (END)