This is a simple piece of PyTorch code to stress test a GPU with a default run-time of 5 minutes.
docker buildx build -t waggle/gpu-stress-test:latest --platform linux/amd64,linux/arm64 --push .
docker build -t waggle/gpu-stress-test:latest .
Note: the image is auto-built by the CI and uploaded to Dockerhub (https://hub.docker.com/r/waggle/gpu-stress-test/tags)
Default run-time:
docker run -it --rm --runtime nvidia --network host waggle/gpu-stress-test:latest
Over-ride run-time to 2 minutes:
docker run -it --rm --runtime nvidia --network host waggle/gpu-stress-test:latest -m 2
Default run-time:
kubectl run gpu-test --image=waggle/gpu-stress-test:1.0.0 --attach=true
Note: delete the running
kubernetes
pod via:kubectl delete pod gpu-test
Default run-time
pluginctl deploy --name gpu-test2 --selector resource.gpu=true waggle/gpu-stress-test:1.0.0
Over-ride run-time to 1 minute:
pluginctl deploy --name gpu-test2 --selector resource.gpu=true waggle/gpu-stress-test:1.0.0 -- -m 1
Note: the source code for the Waggle
pluginctl
tool can be found here: https://github.com/waggle-sensor/edge-scheduler
The cronjob is meant to run the gpu stress in a periodic fashion.
kubectl create -f cronjob.yaml
Check if it was created:
kubectl get cronjobs
Watch until one is created:
kubectl get jobs --watch
Delete cronjob:
kubectl delete -f cronjob.yaml