netdata/helmchart

Netdata deployment issue: PersistentVolume provisioning failure and child pods not loading on k3s cluster

Closed this issue · 5 comments

Garahk commented

Hello team, am deploying netdata in 2 nodes, both have a k3s cluster, I deployed netdata with the helm chart available in github.

On the first node:

The PersistentVolume (PV) objects are not being created, only the PersistentVolumeClaim (PVC) objects are present.
The events for the PVCs show that they are waiting for the first consumer to be created before binding, and the external provisioner is provisioning the volume. However, provisioning is failing with a timeout error.

On the second node:

The parent pod for Netdata is successfully loaded as well as the k8s_state, but the child pods fail to load.
The events for the child pods indicate that there are no available ports on the node for the requested pod ports, and no preemption victims are found.

Is this something related to my setup? or is there something you can shed light on this issue? Is netdata ready for k3s?

Both nodes are almost identical, so am puzzled as to why in one of them the PV/PVC are fine, while on the other are not being created.

Please let me know what logs can I provide, your help would be very appreciated.

Garahk commented

Wrong place to create issue

ilyam8 commented

Hi, @Garahk. I think this is the correct repository for the issue. It doesn't look like a Netdata issue, but something with your setup.

The events for the child pods indicate that there are no available ports on the node for the requested pod ports

Can you show the exact error?

Garahk commented

Hi, @Garahk. I think this is the correct repository for the issue. It doesn't look like a Netdata issue, but something with your setup.

The events for the child pods indicate that there are no available ports on the node for the requested pod ports

Can you show the exact error?

Sure,

1.- Below node 1 child pod description, see the events for more information:

$ kubectl describe pod netdata-child-5nf69 
Name:             netdata-child-5nf69
Namespace:        alo
Priority:         0
Service Account:  netdata
Node:             <none>
Labels:           app=netdata
                  controller-revision-hash=67c4f6d95f
                  pod-template-generation=1
                  release=netdata
                  role=child
Annotations:      checksum/config: 5c478d92bfbe2962128b0d7d8971d60598774fa52c598ce0bb212703b319e0e9
                  container.apparmor.security.beta.kubernetes.io/netdata: unconfined
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    DaemonSet/netdata-child
Init Containers:
  init-persistence:
    Image:      alpine:3.14.2
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
       chmod 777 /persistencevarlibdir; 
    Requests:
      cpu:        10m
    Environment:  <none>
    Mounts:
      /persistencevarlibdir from persistencevarlibdir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9pmf (ro)
Containers:
  netdata:
    Image:      netdata/netdata:v1.38.1
    Port:       19999/TCP
    Host Port:  19999/TCP
    Liveness:   http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
    Readiness:  http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
    Environment:
      MY_POD_NAME:                     netdata-child-5nf69 (v1:metadata.name)
      MY_NODE_NAME:                     (v1:spec.nodeName)
      MY_POD_NAMESPACE:                alo (v1:metadata.namespace)
      NETDATA_LISTENER_PORT:           19999
      NETDATA_PLUGINS_GOD_WATCH_PATH:  /etc/netdata/go.d/sd/go.d.yml
      DO_NOT_TRACK:                    1
      HOME:                            /etc/netdata
    Mounts:
      /etc/netdata/go.d.conf from config (rw,path="go.d")
      /etc/netdata/go.d/k8s_kubelet.conf from config (rw,path="kubelet")
      /etc/netdata/go.d/k8s_kubeproxy.conf from config (rw,path="kubeproxy")
      /etc/netdata/go.d/sd/ from sd-shared (rw)
      /etc/netdata/netdata.conf from config (rw,path="netdata")
      /etc/netdata/stream.conf from config (rw,path="stream")
      /host/ from root (ro)
      /host/etc/os-release from os-release (rw)
      /host/proc from proc (ro)
      /host/sys from sys (rw)
      /var/lib/netdata from persistencevarlibdir (rw)
      /var/run/docker.sock from run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9pmf (ro)
  sd:
    Image:      netdata/agent-sd:v0.2.8
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     50m
      memory:  150Mi
    Requests:
      cpu:     50m
      memory:  100Mi
    Environment:
      NETDATA_SD_CONFIG_MAP:  netdata-child-sd-config-map:config.yml
      MY_POD_NAMESPACE:       alo (v1:metadata.namespace)
      MY_NODE_NAME:            (v1:spec.nodeName)
    Mounts:
      /export/ from sd-shared (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9pmf (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  
  sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  
  os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      netdata-conf-child
    Optional:  false
  persistencevarlibdir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/netdata-k8s-child/var/lib/netdata
    HostPathType:  DirectoryOrCreate
  sd-shared:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  kube-api-access-t9pmf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  7m40s (x410 over 22h)  default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

2.- Next is the parent pod still in node 1, with issue in the PV/PVC:

$ kubectl describe pod netdata-parent-868665b4dc-ftjb8  
Name:             netdata-parent-868665b4dc-ftjb8
Namespace:        alo
Priority:         0
Service Account:  netdata
Node:             <none>
Labels:           app=netdata
                  pod-template-hash=868665b4dc
                  release=netdata
                  role=parent
Annotations:      checksum/config: 5c478d92bfbe2962128b0d7d8971d60598774fa52c598ce0bb212703b319e0e9
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/netdata-parent-868665b4dc
Containers:
  netdata:
    Image:      netdata/netdata:v1.38.1
    Port:       19999/TCP
    Host Port:  0/TCP
    Liveness:   http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
    Readiness:  http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
    Environment:
      MY_POD_NAME:            netdata-parent-868665b4dc-ftjb8 (v1:metadata.name)
      MY_POD_NAMESPACE:       alo (v1:metadata.namespace)
      NETDATA_LISTENER_PORT:  19999
      DO_NOT_TRACK:           1
      HOME:                   /etc/netdata
    Mounts:
      /etc/netdata/netdata.conf from config (rw,path="netdata")
      /etc/netdata/stream.conf from config (rw,path="stream")
      /host/etc/os-release from os-release (rw)
      /var/cache/netdata from database (rw)
      /var/lib/netdata from alarms (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9282 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      netdata-conf-parent
    Optional:  false
  database:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  netdata-parent-database
    ReadOnly:   false
  alarms:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  netdata-parent-alarms
    ReadOnly:   false
  kube-api-access-t9282:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  60m   default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
  Warning  FailedScheduling  50m   default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
  Warning  FailedScheduling  30m   default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
  Warning  FailedScheduling  20m   default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition

3.- Here an example of a PVC that is waiting for the PV to be created:

$ kubectl describe pvc netdata-parent-database
Name:          netdata-parent-database
Namespace:     nia
StorageClass:  local-path
Status:        Pending
Volume:        
Labels:        app=netdata
               app.kubernetes.io/managed-by=Helm
               chart=netdata-3.7.41
               heritage=Helm
               release=netdata
               role=parent
Annotations:   meta.helm.sh/release-name: netdata
               meta.helm.sh/release-namespace: nia
               volume.beta.kubernetes.io/storage-provisioner: rancher.io/local-path
               volume.kubernetes.io/selected-node: nia-datacollector-162.bete.ericy.com
               volume.kubernetes.io/storage-provisioner: rancher.io/local-path
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       netdata-parent-868665b4dc-ftjb8
Events:
  Type    Reason                Age                     From                         Message
  ----    ------                ----                    ----                         -------
  Normal  ExternalProvisioning  4m25s (x5399 over 22h)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "rancher.io/local-path" or manually created by system administrator

4.- And for Node 2, the PV and PVC were created successfully, but the child pod shows the same event as in node 1 pod:

$ kubectl describe pod netdata-child-wk8bk
Name:             netdata-child-wk8bk
Namespace:        alo
Priority:         0
Service Account:  netdata
Node:             <none>
Labels:           app=netdata
                  controller-revision-hash=5c9c67f586
                  pod-template-generation=1
                  release=netdata
                  role=child
Annotations:      checksum/config: 5c478d92bfbe2962128b0d7d8971d60598774fa52c598ce0bb212703b319e0e9
                  container.apparmor.security.beta.kubernetes.io/netdata: unconfined
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    DaemonSet/netdata-child
Init Containers:
  init-persistence:
    Image:      alpine:3.14.2
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
       chmod 777 /persistencevarlibdir; 
    Requests:
      cpu:        10m
    Environment:  <none>
    Mounts:
      /persistencevarlibdir from persistencevarlibdir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ctsqg (ro)
Containers:
  netdata:
    Image:      netdata/netdata:v1.38.1
    Port:       19999/TCP
    Host Port:  19999/TCP
    Liveness:   http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
    Readiness:  http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
    Environment:
      MY_POD_NAME:                     netdata-child-wk8bk (v1:metadata.name)
      MY_NODE_NAME:                     (v1:spec.nodeName)
      MY_POD_NAMESPACE:                alo (v1:metadata.namespace)
      NETDATA_LISTENER_PORT:           19999
      NETDATA_PLUGINS_GOD_WATCH_PATH:  /etc/netdata/go.d/sd/go.d.yml
      DO_NOT_TRACK:                    1
      HOME:                            /etc/netdata
    Mounts:
      /etc/netdata/go.d.conf from config (rw,path="go.d")
      /etc/netdata/go.d/k8s_kubelet.conf from config (rw,path="kubelet")
      /etc/netdata/go.d/k8s_kubeproxy.conf from config (rw,path="kubeproxy")
      /etc/netdata/go.d/sd/ from sd-shared (rw)
      /etc/netdata/netdata.conf from config (rw,path="netdata")
      /etc/netdata/ssl/cert from secret (ro)
      /etc/netdata/ssl/key from key (ro)
      /etc/netdata/stream.conf from config (rw,path="stream")
      /host/ from root (ro)
      /host/etc/os-release from os-release (rw)
      /host/proc from proc (ro)
      /host/sys from sys (rw)
      /var/lib/netdata from persistencevarlibdir (rw)
      /var/run/docker.sock from run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ctsqg (ro)
  sd:
    Image:      netdata/agent-sd:v0.2.8
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     50m
      memory:  150Mi
    Requests:
      cpu:     50m
      memory:  100Mi
    Environment:
      NETDATA_SD_CONFIG_MAP:  netdata-child-sd-config-map:config.yml
      MY_POD_NAMESPACE:       alo (v1:metadata.namespace)
      MY_NODE_NAME:            (v1:spec.nodeName)
    Mounts:
      /export/ from sd-shared (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ctsqg (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  
  sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  
  os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      netdata-conf-child
    Optional:  false
  persistencevarlibdir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/netdata-k8s-child/var/lib/netdata
    HostPathType:  DirectoryOrCreate
  sd-shared:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  eni-health-check-ssl-certificate-secret
    Optional:    false
  key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  eni-health-check-ssl-certificate-key-secret
    Optional:    false
  kube-api-access-ctsqg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  26m (x298 over 24h)  default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Garahk commented

I found the cause behind the PV and PVC issue, it was a matter of not enough space, thus our garbage collector was cleaning automatically.

However, the issue with the ports, remains, I've still not found the solution or cause.

ilyam8 commented

However, the issue with the ports, remains, I've still not found the solution or cause.

Hello, check if ports 19999/8125 are already in use by some other application on your hosts.