gke-prober
runs inside your GKE cluster, gathering and exporting additional metrics not supplied natively by the GKE product. It targets four types of metrics:
- Metrics that describe the health of your nodes
- Metrics that describe the health of node-hosted GKE components (addons in the
kube-system
namespace, as annotated bycomponents.gke.io/component-name
) - Metrics collected from probing of the compute environment (e.g. dns, networking)
- Service level indicators, describing the health of your nodes and node-hosted GKE components in an aggregate fashion
The metrics are labelled with a few other dimensions, as appropriate, related to the metrics themselves.
Note: this is not an officially supported Google product.
Currently, users would need to build its own container image using the provided Dockerfile. For build tools, we recommend the Google Cloud Build
# go to the root directory where the Dockerfile resides
gcloud builds submit --tag $IMG
Manifests in the manifests
directory specify the GCP resources (as KCC resources) and kubernetes resources needed to run gke-prober
in GKE clusters with Workload Identity enabled.
Before applying, set your GCP project in all the manifests by hand, or by executing a kpt function at the root of the repository:
mkdir -p ${HOME}/bin
curl -L https://github.com/GoogleContainerTools/kpt/releases/download/v1.0.0-beta.1/kpt_linux_amd64 --output ~/bin/kpt && chmod u+x ${HOME}/bin/kpt
export PATH=${HOME}/bin:${PATH}
kpt fn eval --truncate-output=false --image gcr.io/kpt-fn/apply-setters:v0.2 manifests -- project=my-gcp-project
Kpt requires Docker to run, if you don't have Docker installed, execute the Sed commend at the root of the repository to set your GCP project id in all the Yaml files:
# replace gcp-project-id with the real project id
find . -type f -name "*.yaml" -print0 | xargs -0 sed -i'' -e 's/my-gcp-project/gcp-project-id/g'
Then apply the manifests to your GKE clusters. Manifests in manifests/gcp
need to be applied into a KCC-enabled cluster. Manifests in manifests/k8s
should be applied in the clusters where you'd like gke-prober
installed.
gke-prober
runs in two modes. These modes are complementary: make sure you run both.
- Cluster Mode: Runs as a deployment with a single replica, gathering cluster-level data.
- Node Mode: Runs as a daemonset, gathering node-level data.
gke-prober
has its own Kubernetes service account in the gke-prober-system
namespace, which requires list
and watch
permissions on the following resources:
nodes
pods
daemonsets
deployments
To expose metrics to Google Cloud Monitorinig, gke-prober
uses the Cloud client libraries for Montiroing.
gke-prober
must have a IAM credential to authenticate to Google Cloud Monitoring APIs.
Google Application Default Credentials is used by Client Libararies to automatically find credentials based on where you run gke-prober
.
ADC
uses GKE Workload Identify, as provided as a KSA in the deployment yaml
Uses credentials you set up with the Google gcloud CLI. You need to provide credential to ADC for use by Cloud Monitoring Client Libraries. See the section Development
gke-prober
runs under an IAM service account with the following IAM permissions:
- monitoring.metricDescriptors.create
- monitoring.timeSeries.create
- monitoring.metricDescriptors.list (optional, for prom-to-sd sidecar used for golang process metrics)
gke-prober
uses the Kubernetes service account (KSA) within your cluster to authenicate to the cluster API server. It also runs as an IAM service account (GSA) to call Cloud Monitoring APIs to expose the metrics it collects to the Cloud Monitoring/StackDriver remote backend.
If you want to install gke-prober
on GKE clusters. GKE Workload Identify is the recommended approach. Workload Identify binds a KSA to a GSA, it is the recommended way for your workloads (e.g. gke-prober
) running on GKE to access Google Cloud services in a secure and manageable way.
If your GKE cluster is configured with Workload Identity the gke-prober
service account needs additional permissions, and needs to be linked to the k8s service account. See the documentation page for Workload Identity, and refer to hack/workload-identity.sh
(setting PROJECT_ID) for a concrete example.
Once running, gke-prober
will emit a variety of metrics.
(work-in-progress!!) Two dashboards are available in dashboards/cloud-monitoring
:
gke-prober-fleet.json
visualizes node and addon metrics at a fleet level (multiple GKE clusters).gke-prober-cluster.json
visualizes node and addon metrics at a cluster level: please filter on your GKE cluster name.gke-prober-performance.json
visualizes the performance ofgke-prober
itself (memory/cpu usage, Cloud Monitoring API usage). Note: in the "Consumed API" chart, modify thecredential_id
filter to match the ID of yourgke-prober-sa
service account.
To import the dashboards, run the following command (Replace "METRICS-PREFIX" with the perfix being used. For example, "gke-prober")
gcloud monitoring dashboards create --config-from-file dashboards/cloud-monitoring/*.json
The dashboards look roughly like this (slightly out-of-date screenshot):
gke-prober
emits the following metrics:
Metric Type | Prober Mode | Implemented | Metric | Source | Description |
---|---|---|---|---|---|
Addon | Cluster | - [x] | cluster/addons_expected | apiserver | Expected count of addons, labelled by addon:$name , controller:{DaemonSet,Deployment} and version:$version |
Node | Cluster | - [x] | cluster/node_available | apiserver | Count of nodes labelled by available:{True,False} , ready:{True,False} , schedulable:{True,False} , done_warming:{True,False} , nodepool:$nodepool , and zone:$zone . Available indicates nodes are healthy , schedulable , done_warming |
Node | Cluster | - [x] | cluster/node_condition | apiserver | Count of nodes by NodeCondition, labelled by nodepool:$nodepool , zone:$zone , type:$type and status:$status |
Node | Node | - [x] | node/available | apiserver | Node availability, labelled by available:{True,False} , ready:{True,False} , schedulable:{True,False} , done_warming:{True,False} , nodepool:$nodepool , and zone:$zone (value 1). |
Node | Node | - [x] | node/condition | apiserver | Node conditions, labelled by nodepool:$nodepool , zone:$zone , type:$type amd status:$status (value 1). |
Addon | Node | - [x] | addon/restart | apiserver | Count of addon restarts on node, labelled by reason:$reason , exit_code:$code , container_name:$container , addon:$name , controller:{DaemonSet,Deployment} , version:$version , nodepool:$nodepool , and zone:$zone |
Addon | Node | - [x] | addon/control_plane_available | apiserver/probe | Addon control plane availability (if all addons on node are scheduled, running, and haven't restarted), labelled by available:{True,False} , nodepool:$nodepool , and zone:$zone (value 1) |
Addon | Node | - [x] | addon/available | apiserver | Count of addon pods on node, labelled by addon:$name , controller:{DaemonSet,Deployment} , version:$version , nodepool:$nodepool , zone:$zone , available:{True,False} , node_available:{True,False} , running:{True,False} , and stable:{True,False} |
Addon | Node | - [] partial* | addon/available | probe | Addon probes will augment this metric with the label healthy:{True,False,Unknown,Error} |
Addon | Node | - [] partial** | addon/addon /* |
probe | Addon probes will emit addon-specific metrics, labelled by addon:$name , controller:{DaemonSet,Deployment}and version:$version` |
Probe | Node | - [] partial*** | probe/probe /* |
probe | Probes will emit metrics gathered by probing the compute environment |
* Health probes implemented for gke-metadata-server
** Metrics emitted for ...
*** Probes for dns-lookup
and http-get
emit a metric called request_latency_microseconds
gke-prober
will attempt to export metrics that can support the following SLIs:
- Node Availability (k8s_node:
node/available
) - Addon Control Plane Availability (k8s_node: ``)
In Cloud Monitoring, metrics emitted by gke-prober
will be visible under:
- the
k8s-cluster
monitored resource when running in cluster mode - the
k8s-node
monitored resource when running in node mode
Addon probes will run from node
mode, and probe the local pods representing the addon
To list the metrics descriptors that gke-prober
creates, run:
go run cmd/listmd/main.go -project my-gcp-project
If you need to delete those metrics descriptors, you can pass -delete
to the above commandline.
Note that you need to authenticate to Google Cloud, either by using a GCP service account or your user credential
As some addons run in the node's network namespace and expose probe-able endpoints on localhost
, the prober can only probe those addons when running with hostNetwork:true
. Examples of those addons include the node-problem-detector
and gke-metrics-agent
.
No GKE components requiring hostNetwork
mode have been implemented as of July 2021.
Note: most GKE addons in this category only export metrics endpoints which may not be a useful (or stable) proxy for health.
To build, export IMG=gcr.io/your_project/gke-prober:latest
and make docker-build docker-push
. Use that IMG in the deployment manifests.
For local development, to expose metrics to Googel Cloud Montiroing, you need to provide your user credentials to ADC
, you use gcloud CLI:
gcloud auth application-default login
After this, gke-prober
will use the permission in your user account, so make sure you have the requisite permissions. See Local Development Environment
Then to launch gke-prober
run this:
go run cmd/localdev/main.go -project=$PROJECT_ID -location=$LOCATION -cluster=$CLUSTER
Process metrics are exported on :8080/metrics