Cluster becomes unstable after 'a few days'

Question

Cluster becomes unstable after 'a few days'

Closed this issue 2 months ago · 0 comments

Hey,

Situation

After several days of running a cluster set up with hetzner-k3s it becomes unstable, with large request processing times, pods being restarted constantly, errors like context deadline exceeded and seemingly outgoing requests failing due to cilium-envoy failing as well.

The cluster is set up with:

Cilium as CNI and Ingress
Istio as additional Mesh for exposing two APIs
Grafana Cloud integration
Around 30 pods of 10 services with essentially no load (FE and BE services that are not in use)

I re-created the cluster multiple times with the same result. I initially suspected that this could be related to throttling, since I was using the Shared vCPUs, for which Hetzner became more aggressive in ensuring they are not used too much. But the exact same issue also happens with Dedicated vCPUs.

Anomalies / Noticed errors

Apologies for the rather random mix of anomalies, as I don't know the root cause it is hard to differentiate between root causes and symptoms:

context deadline exceeded error and timeouts:

Multiple pods constantly failing and being restarted:

Failing outgoing requests / Timeouts

Monitoring

CPU usage is very high, yet this could just be a symptom of pods being constantly restarted.

This is the top output of the worker agent.

There is a lot of swapping being handled, yet this could also either be a symptom instead of the cause.

Cluster config

hetzner_token: [redacted]
cluster_name: mycluster
kubeconfig_path: "./kubeconfig"
k3s_version: v1.30.3+k3s1

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "mycluster_id_ed25519.pub"
    private_key_path: "mycluster_id_ed25519"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: cilium
    cilium:
      chart_version: 1.16.1

datastore:
  mode: etcd # etcd (default) or external
  external_datastore_endpoint: postgres://....

schedule_workloads_on_masters: false

masters_pool:
  instance_type: cx22
  instance_count: 1
  location: fsn1

worker_node_pools:
- name: worker-pool
  instance_type: ccx13
  instance_count: 1
  location: fsn1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 1

embedded_registry_mirror:
  enabled: true

api_server_hostname: mycluster.my-host.de

Additional

What complicates matters further is that I'm using Grafana Cloud as observability and, during this time of these issues, no metrics are being forwarded to Grafana Cloud anymore. I assume this is related to the same issue that is causing the failing outgoing requests (i.e. no pushes reaching Grafana Cloud anymore as well).

Weirdly, a similar or identical situation happens with clusters created by the terraform-hcloud-kube-hetzner project as well.

Lastly, there are some similarities to #424, as the containers crashing there did also crash for me and exhibited similar errors (e.g. the context deadline exceeded error and failing outgoing requests). As the root cause remained undiscovered there, this may be related or the same issue.