gpu-stress-test

This is a simple piece of PyTorch code to stress test a GPU with a default run-time of 5 minutes.

Buildx building and pushing to Dockerhub

docker buildx build -t waggle/gpu-stress-test:latest --platform linux/amd64,linux/arm64 --push .

docker build -t waggle/gpu-stress-test:latest .

Note: the image is auto-built by the CI and uploaded to Dockerhub (https://hub.docker.com/r/waggle/gpu-stress-test/tags)

Default run-time:

docker run -it --rm --runtime nvidia --network host waggle/gpu-stress-test:latest

Over-ride run-time to 2 minutes:

docker run -it --rm --runtime nvidia --network host waggle/gpu-stress-test:latest -m 2

Default run-time:

kubectl run gpu-test --image=waggle/gpu-stress-test:1.0.0 --attach=true

Note: delete the running kubernetes pod via: kubectl delete pod gpu-test

Default run-time

pluginctl deploy --name gpu-test2 --selector resource.gpu=true waggle/gpu-stress-test:1.0.0

Over-ride run-time to 1 minute:

pluginctl deploy --name gpu-test2 --selector resource.gpu=true waggle/gpu-stress-test:1.0.0 -- -m 1

Note: the source code for the Waggle pluginctl tool can be found here: https://github.com/waggle-sensor/edge-scheduler

The cronjob is meant to run the gpu stress in a periodic fashion.

kubectl create -f cronjob.yaml

Check if it was created:

kubectl get cronjobs

Watch until one is created:

kubectl get jobs --watch

Delete cronjob:

kubectl delete -f cronjob.yaml