gardener/machine-controller-manager

Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts.

ashwani2k opened this issue · 0 comments

How to categorize this issue?

/area monitoring
/kind enhancement
/priority 3

What would you like to be added:
With the introduction of a gardener-node-agent which is a controller-runtime based go implementation of the cloud-config downloader, it might be possible to get more insights to what happens during the processing of a node when it joins the cluster or rather fails to join a cluster.
This can help us isolate if the timeouts are happening at the infra layer or there is something wrong during the node processing within the kubernetes runtime.

This may require us to expose some metrics from the node-agent or enhance its logging to tailor for making directed queries from its logs to identify node joining issues.
This will make life easier for the MCM operators in identifying such issues with more determinism then what is possible as of today.

Why is this needed:
Currently we often have issues to analyze and identify why the node hasn't joined in 20mins window of default timeout.
All we have in the logs is following:
Machine shoot--<project-name>--<shoot-name>-<worker-pool>-<zone>-865f7-zggql failed to join the cluster in 20m0s minutes.
The current approach to identify what has gone wrong if the issue persist requires you to follow some FAQ #my-machine-is-not-joining-the-cluster-why to begin with and also might require you to explore the Infra and see if the respective instance status to ascertain if was created successfully but fails to join the cluster.

This is currently a time consuming task with an expectation of fair knowledge of MCM internal to ascertain the root cause.