This guide is meant to run KataGo with TensorRT in a container.
It may also work with other version of KataGo (OpenCL, CUDA, Eigen), but you will have to edit the Dockerfile.
-
Install Docker. If you are using NVIDIA, you have to install the NVIDIA Container Runtime.
-
Clone the repository.
-
Build the docker image or use
darkness4/katago:cuda11.4.2-cudnn8-ubuntu20.04-trt8.2.0.6-ea
:docker build -t katago:tensorrt .
-
Download a KataGo model from KataGo Training and name it
default_model.bin.gz
. -
Create an executable (shell script) to run katago :
#!/bin/sh # katago.sh docker run --rm --gpus all -i \ -v "$(pwd)/default_gtp.cfg:/app/default_gtp.cfg:ro" \ -v "$(pwd)/default_model.bin.gz:/app/default_model.bin.gz" \ katago:tensorrt \ $@
-
Use
katago.sh
as the main entrypoint.chmod +x katago.sh ./katago.sh --help
On the remote machine:
-
Install the SSH server and push your ssh key to your user. The authentication must not use a password.
-
Install Docker in the remote machine. If you are using NVIDIA, you have to install the NVIDIA Container Runtime.
-
Clone the repository.
-
Build the docker image or use
darkness4/katago:cuda11.4.2-cudnn8-ubuntu20.04-trt8.2.0.6-ea
:docker build -t katago:tensorrt .
-
Download a KataGo model from KataGo Training and name it
default_model.bin.gz
. -
Create an executable (shell script) to run katago :
#!/bin/sh # /home/remote-user/katago.sh docker run --rm --gpus all -i \ -v "$(pwd)/default_gtp.cfg:/app/default_gtp.cfg:ro" \ -v "$(pwd)/default_model.bin.gz:/app/default_model.bin.gz" \ katago:tensorrt \ $@
-
Make it executable and test it.
chmod +x katago.sh ./katago.sh --help
On your local machine:
-
Create an executable
#!/bin/sh # katago-remote.sh ssh remote-user@remote-machine /home/remote-user/katago.sh $@
-
Make it executable and test it.
chmod +x katago-remote.sh ./katago-remote.sh --help
On any machine:
Skip this part if you perfer to use the Docker image darkness4/katago:cuda11.4.2-cudnn8-ubuntu20.04-trt8.2.0.6-ea
.
-
Clone the repository.
-
Build the docker image:
docker build -t user/katago:tensorrt .
-
Push in a registry:
docker push user/katago:tensorrt
On the remote machine:
-
Install Slurm, Pyxis, Enroot, NVIDIA Container Runtime.
-
Download a KataGo model from KataGo Training and name it
default_model.bin.gz
. You have also to put thedefault_gtp.cfg
. -
Create an executable (shell script) to run katago in a slurm job:
#!/bin/sh # /home/remote-user/katago.sh set -ex if [ ! -f "$(pwd)/katago.sqsh" ]; then srun --ntasks=1 \ --container-image=user/katago:tensorrt \ --container-save="$(pwd)/katago.sqsh" \ true fi tries=1; while [ "$tries" -lt 10 ]; do if file "$(pwd)/katago.sqsh" | grep -q "Squashfs filesystem"; then break fi echo "Image is not complete. Wait a few seconds... ($tries/ 10)" sleep 10 tries=$((tries+1)) done if [ "$tries" -ge 10 ]; then echo "Image import failure. Please try again." exit 1 fi srun --gpus=1 \ --container-image="$(pwd)/katago.sqsh" \ --container-mounts="$(pwd)/default_gtp.cfg:/app/default_gtp. cfg:ro,$(pwd)/default_model.bin.gz:/app/default_model.bin. gz:ro" \ /app/katago $@
-
Make it executable and test it.
chmod +x katago.sh ./katago.sh --help
On your local machine:
-
Create an executable
#!/bin/sh # katago-remote.sh ssh remote-user@remote-machine /home/remote-user/katago.sh $@
-
Make it executable and test it.
chmod +x katago-remote.sh ./katago-remote.sh --help