Cluster becomes unstable after 'a few days'
Closed this issue · 0 comments
Hey,
Situation
After several days of running a cluster set up with hetzner-k3s it becomes unstable, with large request processing times, pods being restarted constantly, errors like context deadline exceeded
and seemingly outgoing requests failing due to cilium-envoy failing as well.
The cluster is set up with:
- Cilium as CNI and Ingress
- Istio as additional Mesh for exposing two APIs
- Grafana Cloud integration
- Around 30 pods of 10 services with essentially no load (FE and BE services that are not in use)
I re-created the cluster multiple times with the same result. I initially suspected that this could be related to throttling, since I was using the Shared vCPUs, for which Hetzner became more aggressive in ensuring they are not used too much. But the exact same issue also happens with Dedicated vCPUs.
Anomalies / Noticed errors
Apologies for the rather random mix of anomalies, as I don't know the root cause it is hard to differentiate between root causes and symptoms:
context deadline exceeded
error and timeouts:
Multiple pods constantly failing and being restarted:
Failing outgoing requests / Timeouts
Monitoring
CPU usage is very high, yet this could just be a symptom of pods being constantly restarted.
This is the top output of the worker agent.
There is a lot of swapping being handled, yet this could also either be a symptom instead of the cause.
Cluster config
hetzner_token: [redacted]
cluster_name: mycluster
kubeconfig_path: "./kubeconfig"
k3s_version: v1.30.3+k3s1
networking:
ssh:
port: 22
use_agent: false
public_key_path: "mycluster_id_ed25519.pub"
private_key_path: "mycluster_id_ed25519"
allowed_networks:
ssh:
- 0.0.0.0/0
api:
- 0.0.0.0/0
public_network:
ipv4: true
ipv6: true
private_network:
enabled : true
subnet: 10.0.0.0/16
existing_network_name: ""
cni:
enabled: true
encryption: false
mode: cilium
cilium:
chart_version: 1.16.1
datastore:
mode: etcd # etcd (default) or external
external_datastore_endpoint: postgres://....
schedule_workloads_on_masters: false
masters_pool:
instance_type: cx22
instance_count: 1
location: fsn1
worker_node_pools:
- name: worker-pool
instance_type: ccx13
instance_count: 1
location: fsn1
autoscaling:
enabled: true
min_instances: 0
max_instances: 1
embedded_registry_mirror:
enabled: true
api_server_hostname: mycluster.my-host.de
Additional
What complicates matters further is that I'm using Grafana Cloud as observability and, during this time of these issues, no metrics are being forwarded to Grafana Cloud anymore. I assume this is related to the same issue that is causing the failing outgoing requests (i.e. no pushes reaching Grafana Cloud anymore as well).
Weirdly, a similar or identical situation happens with clusters created by the terraform-hcloud-kube-hetzner project as well.
Lastly, there are some similarities to #424, as the containers crashing there did also crash for me and exhibited similar errors (e.g. the context deadline exceeded
error and failing outgoing requests). As the root cause remained undiscovered there, this may be related or the same issue.