This app is designed to be used alongwith AlertManager (Webhook POST payload) or any direct GET requests to start or stop an EC2 instance.
Inspired from tons of wasted dollars on GPU instances where we didn't quite use GPU but kept the instance running hoping we'd need to use it in the next 30 mins. We all know those 30 mins easily become 3 hours and even 3 days.
It supports 2 APIs:
operateInstance
- request contains InstanceId of the EC2 instance to start/stop.operateHostname
- request contains resolvable hostname. App will look for its IP address and use that inprivate-ip-address
filter to get InstanceId from AWS EC2 APIs.
It also supports 2 different implementations of EC2 Client:
native
- Instantiates EC2 go-sdk client and uses session contexts. Useful when you have Access Key and Secret Key and aws profile configured on host machine.cli
- Usesaws ec2
CLI shell commands. Useful when your host machine has roles / permissions attached to start / stop required EC2 instances.
additionalPrometheusRulesMap:
gpu.rules:
groups:
- name: gpu.rules
rules:
- alert: UnusedGPU
expr: max_over_time(nvidia_smi_utilization_gpu_ratio{instance="your-hostname:9835"}[5m]) * 100 < 2
labels:
severity: critical
hostname: your-hostname
action: stop
client: cli
dryrun: "false"
annotations:
description: Instance {{$labels.instance}} has GPU usage of {{ $value }}%
summary: Unused GPU
alertmanager:
enabled: true
config:
...
route:
...
group_by: ['namespace']
routes:
...
- receiver: 'ec2operator'
matchers:
- alertname =~ "UnusedGPU"
receivers:
- name: 'ec2operator'
webhook_configs:
- url: 'https://ec2op.your-domain.com/operateHostname'
send_resolved: false
Usage - Local, Docker and Kubernetes
Default properties are picked up from ENV:
HOST
-0.0.0.0
(IP on which server listens, you can also change it tolocalhost
for testing purposes)PORT
-8080
Use Docker to run it locally or on any EC2 instance:
docker run -p 8080:8080 rahul67/ec2-operator:v0.7
Refer examples/ for kubernetes deployment along with service definition.