kubemonitor is a light utility for monitoring Kubernetes clusters. It is deployed as a stateless service and built using Spring Boot and the fabric8 kubernetes-client library. Similar to Bitnami's kubewatch but with more generic focus.
Developed for use with GKE, currently being adjusted to work with any Kubernetes deployment.
- Monitor Kubernetes clusters in GKE by comparing running StatefulSets and Deployments against a JSON config of expected workloads; detect missing workloads or workloads where not all replicas are ready
- Parse Kubernetes Events and notify on unhealthy conditions such as failing liveness/readiness probes, OOMKills, etc.
- Detect new deployments and optionally notify about them if your CI/CD pipeline doesn't do it already
- Schedule automated workload scale down/scale up, useful for weekend maintenance windows
- Interface via HTTP/REST or scheduled jobs, and send Slack notifications
Prometheus can be used to monitor clusters, but doesn't work when the entire job to create the cluster has silently failed; it also uses a trend/metric based approach and can miss things like single liveness probe failures. K8s Events contain important cluster information but only exist for an hour; even the official documentation recommends installing a 3rd party solution. I couldn't find a good free one (Stackdriver on GCP takes forever to configure and suffers from the same slowness/trend based approach as Prometheus), so I built my own.
Currently kubemonitor is limited to GKE, but this will change in the near future. You will need to create a GCP service account and give it the proper permissions to access your cluster. This can be done with the following commands:
gcloud iam service-accounts create your-account-name
gcloud iam service-accounts keys create service-account-private-key.json \
--iam-account=your-account-name@PROJECT_ID.iam.gserviceaccount.com
gcloud projects add-iam-policy-binding PROJECT_ID \
--member=serviceAccount:your-account-name@PROJECT_ID.iam.gserviceaccount.com \
--role=roles/container.developer
The service has been tested with the container.developer
role, but this may give more access than you would like, so adjust
accordingly - read-only access should be enough if you don't use the maintenance scheduling feature.
GCP service accounts can access multiple projects; see
this link if you're having trouble
configuring cross-project access.
Once you've generated a private key, place it into the root directory as service-account-private-key.json
before
building the Docker container.
You will also need to adjust monitoring-config.json
with your clusters.
After this you can call the GET /status?project=your-project&cluster=your-cluster
endpoint to check that everything works.
Currently the service is completely stateless and has no other dependencies.
Read more about service account authentication here:
Example configuration:
{
"clusters": [
{
"project": "my-gcp-project",
"name": "my-gke-cluster",
"zone": "europe-west4-b",
"enabled": true,
"events": {
"enabled": true,
"types": {
"readiness-probe": false,
"liveness-probe": true,
"oom-kill": true
}
},
"notifications": {
"events": {
"enabled": true,
"slack-webhook": "my-channel"
},
"deployments": {
"enabled": false
},
"maintenance": {
"enabled": false
}
},
"expected-workloads": {
"deployments": [
"example-service-1",
"example-service-2"
],
"stateful-sets": [
"example-service-3",
"example-service-4"
]
}
}
],
"slack-webhooks": [
{
"name": "my-channel",
"url": "https://hooks.slack.com/services/.."
}
]
}
-
clusters
- controls which clusters are visible to the tool and can be queried via HTTP for status, monitored for events, used for maintenance requests, etc.cluster.project
- GCP project name.cluster.name
- GKE cluster name.cluster.zone
- GKE cluster zone.cluster.enabled
- if this is set to false, you can leave the cluster in the configuration, but it will be ignored entirely.cluster.events
- event configuration:events.enabled
- if this cluster should be watched for k8s Events. Events are used to detect expected deployments of new code, so turning this off might result in some false positives from monitoring.events.types
- configure which types of events are alerted on or ignored.types.readiness-probe
- alert on readiness probe failures. Container kills due to probe failure will always be alerted on if events are enabled.types.liveness-probe
- alert on liveness probe failures. Container kills due to probe failure will always be alerted on if events are enabled.types.oom-kill
- alert on out of memory kill events ("OOMKilling").
cluster.notifications
- notification configuration:notifications.events
- send notifications for k8s Events.notifications.deployments
- send notifications for detected deployments.notifications.maintenance
- send notifications for maintenance.
cluster.expected-workloads
- which workloads to "expect" when checking the cluster.expected-workloads.deployments
- array of Deployment names.expected-workloads.stateful-sets
- array of StatefulSet names.
slack-webhooks
- webhook objects that can be re-used through-out the configuration, to avoid pasting URLs everywhere.
GET /status?project=<project>&cluster=<cluster>
- get cluster status. Cluster must be in monitoring config.
Response:
{
"project": "my-project",
"cluster": "my-cluster",
"state": "OK",
"issues": [],
"deploymentsActive": []
}
Possible states:
OK
- everything is fineDEGRADED
- some things are fine, but others are notDOWN
- the entire cluster appears to be missing
issues
- list of strings describing detected issues, such as missing or not fully healthy workloads.
deploymentsActive
- deployments of new code detected via k8s Events - these will cancel out a missing or unhealthy workload.
Example of an unhealthy cluster status:
{
"project": "my-project",
"cluster": "my-cluster",
"state": "DEGRADED",
"issues": [
"StatefulSet example-service-3 expected 1 replicas, but only 0 ready"
],
"deploymentsActive": []
}
POST /maintenance
- create maintenance task that will scale down and then scale up the selected workloads. Cluster must be in monitoring config.
Payload:
{
"project": "my-project",
"cluster": "my-cluster",
"startTime": "2020-10-23T14:42:11.355982Z",
"endTime": "2020-10-23T18:00:00Z",
"workloadsToScale": [
"example-service-1",
"example-service-2",
"example-service-3"
]
}
Note that the service is currently stateless and these are not persisted if restarted.
GET /now
- get current time in ISO timestamp (useful for creating maintenance payload).
GET /config
- return current cluster config.
To build:
./gradlew clean build
Pull requests are welcome and encouraged.
- Strip out GKE/GCP specific code
- Explore converting to a public Helm chart, similar to Bitnami's kubewatch
- More reliable detection of deployments, especially for long-starting services
- Capture other unhealthy events - see this link
- Monitor other workload types - DaemonSets, etc.
- Add stateful storage for maintenance requests, etc.
- Move to k8s informer instead of watch