digitalocean/DOKS

Question about node taints with regard to doks-managed 'coredns' deployment

Opened this issue ยท 1 comments

Hello ๐Ÿ‘‹

I've a bug report/question about the title:

I've recently created a new cluster on DO, with following Terraform configuration

resource "digitalocean_kubernetes_cluster" "main" {
  # ...
  node_pool {
    # ...
    labels     = {}
    tags       = []
    taint {
      key    = "x-resource-kind"
      value  = "apps"
      effect = "NoSchedule"
    }
  }
}

resource "digitalocean_kubernetes_node_pool" "pool-main-storages" {
  # ...
  labels     = {}
  tags       = []
  taint {
    key    = "x-resource-kind"
    value  = "storages"
    effect = "NoSchedule"
  }
}

Basically I want the new nodes spawned to automatically be given a taint, since I want to control my current/future pods resources for internal usages. The clusters & node pools are created fine, and so is the taint

captain@glados:~$ kubectl describe nodes pool-main-fv5zb
# ...
Taints:             x-resource-kind=apps:NoSchedule
# ...

But I noticed that one of the deployments are not running (coredns)

captain@glados:~$ kubectl get deployment -n kube-system
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
cilium-operator   1/1     1            1           10h
coredns           0/2     2            0           10h

captain@glados:~$ kubectl get pods -n kube-system
NAME                              READY   STATUS    RESTARTS   AGE
cilium-operator-98d97cdf6-phw2j   1/1     Running   0          10h
cilium-plbv2                      1/1     Running   0          10h
coredns-575d7877bb-9sxdl          0/1     Pending   0          10h
coredns-575d7877bb-pwjtl          0/1     Pending   0          10h
cpc-bridge-proxy-hl55s            1/1     Running   0          10h
konnectivity-agent-dcgsg          1/1     Running   0          10h
kube-proxy-zfn9p                  1/1     Running   0          10h

captain@glados:~$ kubectl describe pod/coredns-575d7877bb-9sxdl -n kube-system
# ...
Events:
  Type     Reason             Age                      From                Message
  ----     ------             ----                     ----                -------
  Warning  FailedScheduling   31m (x118 over 10h)      default-scheduler   0/1 nodes are available: 1 node(s) had untolerated taint {x-resource-kind: apps}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  2m10s (x431 over 7h16m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {x-resource-kind: apps}

Is this expected? From the logs I understand why it didn't trigger the scale up, it's just that I don't know whether this is the proper behaviour or not.

It's also that other kube-system pods/deployments are running fine, I think because the tolerations are set up to "always tolerate everything"

captain@glados:~$ kubectl describe pod/cilium-plbv2 -n kube-system
# ...
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
# ...

versus

captain@glados:~$ kubectl describe pod/coredns-575d7877bb-9sxdl -n kube-system
# ...
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
# ...

As per the reference

If this is expected, you can close this issue. If not then the default deployment might need to be adjusted maybe? Though I don't know whether this will affect others

Hey ๐Ÿ‘‹

Most of the managed workloads that are deployed into the data plane are critical Daemonsets that must or should always run to provide core functionality. CoreDNS is in a bit of a mixed state, in the sense that it does provide core functionality but that it must also run on a worker node that is considered healthy (and that it should be moved / evicted to one should its hosting node become unhealthy).

I think we haven't revisited the current tolerations in a while, so there's possibly an opportunity to improve here. That said, I'd be hesitant to give it a blank toleration since, for instance, we wouldn't want CoreDNS to continue to run on a node that's under memory pressure.

There have also been requests to support extending the list of tolerations on CoreDNS by customers to make it better suit custom taints associated to node pools (as you did). This is something we're also considering to do.

With that background shared, I'd be curious to hear what people's preferences are (and why if non-obvious). This would help us plan for the right, next move in this context.