Netdata deployment issue: PersistentVolume provisioning failure and child pods not loading on k3s cluster
Closed this issue · 5 comments
Hello team, am deploying netdata in 2 nodes, both have a k3s cluster, I deployed netdata with the helm chart available in github.
On the first node:
The PersistentVolume (PV) objects are not being created, only the PersistentVolumeClaim (PVC) objects are present.
The events for the PVCs show that they are waiting for the first consumer to be created before binding, and the external provisioner is provisioning the volume. However, provisioning is failing with a timeout error.
On the second node:
The parent pod for Netdata is successfully loaded as well as the k8s_state, but the child pods fail to load.
The events for the child pods indicate that there are no available ports on the node for the requested pod ports, and no preemption victims are found.
Is this something related to my setup? or is there something you can shed light on this issue? Is netdata ready for k3s?
Both nodes are almost identical, so am puzzled as to why in one of them the PV/PVC are fine, while on the other are not being created.
Please let me know what logs can I provide, your help would be very appreciated.
Wrong place to create issue
Hi, @Garahk. I think this is the correct repository for the issue. It doesn't look like a Netdata issue, but something with your setup.
The events for the child pods indicate that there are no available ports on the node for the requested pod ports
Can you show the exact error?
Hi, @Garahk. I think this is the correct repository for the issue. It doesn't look like a Netdata issue, but something with your setup.
The events for the child pods indicate that there are no available ports on the node for the requested pod ports
Can you show the exact error?
Sure,
1.- Below node 1 child pod description, see the events for more information:
$ kubectl describe pod netdata-child-5nf69
Name: netdata-child-5nf69
Namespace: alo
Priority: 0
Service Account: netdata
Node: <none>
Labels: app=netdata
controller-revision-hash=67c4f6d95f
pod-template-generation=1
release=netdata
role=child
Annotations: checksum/config: 5c478d92bfbe2962128b0d7d8971d60598774fa52c598ce0bb212703b319e0e9
container.apparmor.security.beta.kubernetes.io/netdata: unconfined
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/netdata-child
Init Containers:
init-persistence:
Image: alpine:3.14.2
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
chmod 777 /persistencevarlibdir;
Requests:
cpu: 10m
Environment: <none>
Mounts:
/persistencevarlibdir from persistencevarlibdir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9pmf (ro)
Containers:
netdata:
Image: netdata/netdata:v1.38.1
Port: 19999/TCP
Host Port: 19999/TCP
Liveness: http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
Readiness: http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
Environment:
MY_POD_NAME: netdata-child-5nf69 (v1:metadata.name)
MY_NODE_NAME: (v1:spec.nodeName)
MY_POD_NAMESPACE: alo (v1:metadata.namespace)
NETDATA_LISTENER_PORT: 19999
NETDATA_PLUGINS_GOD_WATCH_PATH: /etc/netdata/go.d/sd/go.d.yml
DO_NOT_TRACK: 1
HOME: /etc/netdata
Mounts:
/etc/netdata/go.d.conf from config (rw,path="go.d")
/etc/netdata/go.d/k8s_kubelet.conf from config (rw,path="kubelet")
/etc/netdata/go.d/k8s_kubeproxy.conf from config (rw,path="kubeproxy")
/etc/netdata/go.d/sd/ from sd-shared (rw)
/etc/netdata/netdata.conf from config (rw,path="netdata")
/etc/netdata/stream.conf from config (rw,path="stream")
/host/ from root (ro)
/host/etc/os-release from os-release (rw)
/host/proc from proc (ro)
/host/sys from sys (rw)
/var/lib/netdata from persistencevarlibdir (rw)
/var/run/docker.sock from run (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9pmf (ro)
sd:
Image: netdata/agent-sd:v0.2.8
Port: <none>
Host Port: <none>
Limits:
cpu: 50m
memory: 150Mi
Requests:
cpu: 50m
memory: 100Mi
Environment:
NETDATA_SD_CONFIG_MAP: netdata-child-sd-config-map:config.yml
MY_POD_NAMESPACE: alo (v1:metadata.namespace)
MY_NODE_NAME: (v1:spec.nodeName)
Mounts:
/export/ from sd-shared (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9pmf (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
run:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: netdata-conf-child
Optional: false
persistencevarlibdir:
Type: HostPath (bare host directory volume)
Path: /var/lib/netdata-k8s-child/var/lib/netdata
HostPathType: DirectoryOrCreate
sd-shared:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
kube-api-access-t9pmf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7m40s (x410 over 22h) default-scheduler 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
2.- Next is the parent pod still in node 1, with issue in the PV/PVC:
$ kubectl describe pod netdata-parent-868665b4dc-ftjb8
Name: netdata-parent-868665b4dc-ftjb8
Namespace: alo
Priority: 0
Service Account: netdata
Node: <none>
Labels: app=netdata
pod-template-hash=868665b4dc
release=netdata
role=parent
Annotations: checksum/config: 5c478d92bfbe2962128b0d7d8971d60598774fa52c598ce0bb212703b319e0e9
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/netdata-parent-868665b4dc
Containers:
netdata:
Image: netdata/netdata:v1.38.1
Port: 19999/TCP
Host Port: 0/TCP
Liveness: http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
Readiness: http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
Environment:
MY_POD_NAME: netdata-parent-868665b4dc-ftjb8 (v1:metadata.name)
MY_POD_NAMESPACE: alo (v1:metadata.namespace)
NETDATA_LISTENER_PORT: 19999
DO_NOT_TRACK: 1
HOME: /etc/netdata
Mounts:
/etc/netdata/netdata.conf from config (rw,path="netdata")
/etc/netdata/stream.conf from config (rw,path="stream")
/host/etc/os-release from os-release (rw)
/var/cache/netdata from database (rw)
/var/lib/netdata from alarms (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t9282 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: netdata-conf-parent
Optional: false
database:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: netdata-parent-database
ReadOnly: false
alarms:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: netdata-parent-alarms
ReadOnly: false
kube-api-access-t9282:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 60m default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
Warning FailedScheduling 50m default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
Warning FailedScheduling 30m default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
Warning FailedScheduling 20m default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
3.- Here an example of a PVC that is waiting for the PV to be created:
$ kubectl describe pvc netdata-parent-database
Name: netdata-parent-database
Namespace: nia
StorageClass: local-path
Status: Pending
Volume:
Labels: app=netdata
app.kubernetes.io/managed-by=Helm
chart=netdata-3.7.41
heritage=Helm
release=netdata
role=parent
Annotations: meta.helm.sh/release-name: netdata
meta.helm.sh/release-namespace: nia
volume.beta.kubernetes.io/storage-provisioner: rancher.io/local-path
volume.kubernetes.io/selected-node: nia-datacollector-162.bete.ericy.com
volume.kubernetes.io/storage-provisioner: rancher.io/local-path
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: netdata-parent-868665b4dc-ftjb8
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 4m25s (x5399 over 22h) persistentvolume-controller waiting for a volume to be created, either by external provisioner "rancher.io/local-path" or manually created by system administrator
4.- And for Node 2, the PV and PVC were created successfully, but the child pod shows the same event as in node 1 pod:
$ kubectl describe pod netdata-child-wk8bk
Name: netdata-child-wk8bk
Namespace: alo
Priority: 0
Service Account: netdata
Node: <none>
Labels: app=netdata
controller-revision-hash=5c9c67f586
pod-template-generation=1
release=netdata
role=child
Annotations: checksum/config: 5c478d92bfbe2962128b0d7d8971d60598774fa52c598ce0bb212703b319e0e9
container.apparmor.security.beta.kubernetes.io/netdata: unconfined
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/netdata-child
Init Containers:
init-persistence:
Image: alpine:3.14.2
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
chmod 777 /persistencevarlibdir;
Requests:
cpu: 10m
Environment: <none>
Mounts:
/persistencevarlibdir from persistencevarlibdir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ctsqg (ro)
Containers:
netdata:
Image: netdata/netdata:v1.38.1
Port: 19999/TCP
Host Port: 19999/TCP
Liveness: http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
Readiness: http-get http://:http/api/v1/info delay=0s timeout=1s period=30s #success=1 #failure=3
Environment:
MY_POD_NAME: netdata-child-wk8bk (v1:metadata.name)
MY_NODE_NAME: (v1:spec.nodeName)
MY_POD_NAMESPACE: alo (v1:metadata.namespace)
NETDATA_LISTENER_PORT: 19999
NETDATA_PLUGINS_GOD_WATCH_PATH: /etc/netdata/go.d/sd/go.d.yml
DO_NOT_TRACK: 1
HOME: /etc/netdata
Mounts:
/etc/netdata/go.d.conf from config (rw,path="go.d")
/etc/netdata/go.d/k8s_kubelet.conf from config (rw,path="kubelet")
/etc/netdata/go.d/k8s_kubeproxy.conf from config (rw,path="kubeproxy")
/etc/netdata/go.d/sd/ from sd-shared (rw)
/etc/netdata/netdata.conf from config (rw,path="netdata")
/etc/netdata/ssl/cert from secret (ro)
/etc/netdata/ssl/key from key (ro)
/etc/netdata/stream.conf from config (rw,path="stream")
/host/ from root (ro)
/host/etc/os-release from os-release (rw)
/host/proc from proc (ro)
/host/sys from sys (rw)
/var/lib/netdata from persistencevarlibdir (rw)
/var/run/docker.sock from run (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ctsqg (ro)
sd:
Image: netdata/agent-sd:v0.2.8
Port: <none>
Host Port: <none>
Limits:
cpu: 50m
memory: 150Mi
Requests:
cpu: 50m
memory: 100Mi
Environment:
NETDATA_SD_CONFIG_MAP: netdata-child-sd-config-map:config.yml
MY_POD_NAMESPACE: alo (v1:metadata.namespace)
MY_NODE_NAME: (v1:spec.nodeName)
Mounts:
/export/ from sd-shared (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ctsqg (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
run:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: netdata-conf-child
Optional: false
persistencevarlibdir:
Type: HostPath (bare host directory volume)
Path: /var/lib/netdata-k8s-child/var/lib/netdata
HostPathType: DirectoryOrCreate
sd-shared:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
secret:
Type: Secret (a volume populated by a Secret)
SecretName: eni-health-check-ssl-certificate-secret
Optional: false
key:
Type: Secret (a volume populated by a Secret)
SecretName: eni-health-check-ssl-certificate-key-secret
Optional: false
kube-api-access-ctsqg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 26m (x298 over 24h) default-scheduler 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
I found the cause behind the PV and PVC issue, it was a matter of not enough space, thus our garbage collector was cleaning automatically.
However, the issue with the ports, remains, I've still not found the solution or cause.
However, the issue with the ports, remains, I've still not found the solution or cause.
Hello, check if ports 19999/8125 are already in use by some other application on your hosts.