Prometheus GPU Metrics Exporter (PGME)
PGME is a GPU Metrics exporters that leverages the nvidai-smi binary. The initial work and key metric gathering code is derived from:
Nvidia-smi command used to gather metrics:
nvidia-smi --query-gpu=name,index,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv,noheader,nounits
I have added the following in an attempt to make it a more robust service:
- configuration via environment variables
- Makefile for local build
- liveness HTTP request probe for Kubernetes(k8s)
- graceful shutdown of http server
- exporter details at http://[[ip of server]]:[[port]/
- Integration with AWS Codebuild and Publishing to DockerHub or AWS ECR via different buildspec files
Working On:
- Kubernetes service and helm configuration
Building
Local MAC Build (Generates a binary that works on OSX based systems)
git clone https://github.com/chhibber/pgme.git
cd pgme
make build-mac
Local Linux Build (Genrates a binary that works on Linux systems)
https://github.com/chhibber/pgme.git
cd pgme
make build
Local Docker Build (Generates a docker image)
https://github.com/chhibber/pgme.git
cd pgme
make docker-build IMAGE_REPO_NAME=[[ repo_name/app_name ]] IMAGE_TAG=[[ version info ]]
# Example run
nvidia-docker run -p 9101:9101 chhibber/pgme
2018/01/05 21:32:31 Starting the service...
2018/01/05 21:32:31 - PORT set to 9101. If environment variable PORT is not set the default is 9101
2018/01/05 21:32:31 The service is listening on 9101
...
Running the binary directly
- The default port is 9101
You can change the port by defining the environment variabl PORT in front of the binary.
> PORT=9101 ./pgme
Runnign via Docker (Needed to expose the GPU to the running container)
nvidia-docker run -p 9101:9101 chhibber/pgme:2017.01
http://localhost:9101/metrics
Available Metrics -temperature_gpu{gpu="TITAN X (Pascal)[0]"} 41
utilization_gpu{gpu="TITAN X (Pascal)[0]"} 0
utilization_memory{gpu="TITAN X (Pascal)[0]"} 0
memory_total{gpu="TITAN X (Pascal)[0]"} 12189
memory_free{gpu="TITAN X (Pascal)[0]"} 12189
memory_used{gpu="TITAN X (Pascal)[0]"} 0
temperature_gpu{gpu="TITAN X (Pascal)[1]"} 78
utilization_gpu{gpu="TITAN X (Pascal)[1]"} 95
utilization_memory{gpu="TITAN X (Pascal)[1]"} 59
memory_total{gpu="TITAN X (Pascal)[1]"} 12189
memory_free{gpu="TITAN X (Pascal)[1]"} 1738
memory_used{gpu="TITAN X (Pascal)[1]"} 10451
temperature_gpu{gpu="TITAN X (Pascal)[2]"} 83
utilization_gpu{gpu="TITAN X (Pascal)[2]"} 99
utilization_memory{gpu="TITAN X (Pascal)[2]"} 82
memory_total{gpu="TITAN X (Pascal)[2]"} 12189
memory_free{gpu="TITAN X (Pascal)[2]"} 190
memory_used{gpu="TITAN X (Pascal)[2]"} 11999
temperature_gpu{gpu="TITAN X (Pascal)[3]"} 84
utilization_gpu{gpu="TITAN X (Pascal)[3]"} 97
utilization_memory{gpu="TITAN X (Pascal)[3]"} 76
memory_total{gpu="TITAN X (Pascal)[3]"} 12189
memory_free{gpu="TITAN X (Pascal)[3]"} 536
memory_used{gpu="TITAN X (Pascal)[3]"} 11653
Prometheus example config
- job_name: "gpu_exporter"
static_configs:
- targets: ['localhost:9101']