Azure/azure-container-networking

Cilium-cni keeps retrying for an occupied pod ip

kuzhao opened this issue · 4 comments

kuzhao commented

What happened:
cilium-cni seems to be requesting repeatedly for an occupied pod ip.

What you expected to happen:
The CNI should scan, move to the next and retry instead of repeating for the same address.

How to reproduce it:
No explicit steps. The cluster has one node + ~20 pods running, with a podCIDR of /25.

Orchestrator and Version (e.g. Kubernetes, Docker):
K8s v1.24.9

Operating System (Linux/Windows):
Linux

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows):
5.4.0-1104-azure

Anything else we need to know?:
[Miscellaneous information that will assist in solving the issue.]
Troubling logs from cilium agent:

2023-03-30T07:57:21.089254816Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=1afa583ef3137114a1eda4e76123b5c88216a1eeeb5a5327743ad05d09aa8b09 datapathConfiguration="&{false true true false true 0xc001867e68}" interface=lxc43ab54d9dcf2 k8sPodName=default/details-v1-7d4d9d5fcb-zd5k9 labels="[]" subsys=daemon sync-build=true
2023-03-30T07:57:21.089273190Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon
2023-03-30T07:57:26.045821788Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=f6d721759cdd2a7cc4e46b8de7d233c09c3172236b44ca2b25607366f6f6a970 datapathConfiguration="&{false true true false true 0xc0008c2aa8}" interface=lxcab4ec2801bd9 k8sPodName=default/details-v1-7d4d9d5fcb-7nvxc labels="[]" subsys=daemon sync-build=true
2023-03-30T07:57:26.045871090Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon
2023-03-30T07:57:36.073875809Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=4401f8135da825e6668c3c1f454625cf753eeb838d9d1c6886dfec27b3a936d7 datapathConfiguration="&{false true true false true 0xc000d80880}" interface=lxc86700fb898a9 k8sPodName=default/details-v1-7d4d9d5fcb-zd5k9 labels="[]" subsys=daemon sync-build=true
2023-03-30T07:57:36.073894814Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon
2023-03-30T07:57:38.052279461Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=76c88524340dd8d668dd3851806cac275b6b089455e41474f21c0a954be0281f datapathConfiguration="&{false true true false true 0xc002183408}" interface=lxc3795de5a5a8f k8sPodName=default/details-v1-7d4d9d5fcb-7nvxc labels="[]" subsys=daemon sync-build=true
2023-03-30T07:57:38.052337128Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon
2023-03-30T07:57:50.115119160Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=25f95864c55dc647d79ab9bf7d957fc7b6c35ba5fe26973e77c6bc82108b2554 datapathConfiguration="&{false true true false true 0xc0011d0e38}" interface=lxc40fd61ed981f k8sPodName=default/details-v1-7d4d9d5fcb-zd5k9 labels="[]" subsys=daemon sync-build=true
2023-03-30T07:57:50.115130631Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon
2023-03-30T07:58:04.105159884Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=17efd1d625e451ec81040ab99c6dc70e583d2a5e28ff1ed342a1b33252651933 datapathConfiguration="&{false true true false true 0xc00139e9d8}" interface=lxc31e034abeac8 k8sPodName=default/details-v1-7d4d9d5fcb-zd5k9 labels="[]" subsys=daemon sync-build=true
2023-03-30T07:58:04.105175894Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon
2023-03-30T07:58:16.064801141Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=b41a828ec0f791d479b3b3c72cfcb35167d654647a33e9c145775d41b2215d5b datapathConfiguration="&{false true true false true 0xc00260dd88}" interface=lxcaabe95a91d26 k8sPodName=default/details-v1-7d4d9d5fcb-zd5k9 labels="[]" subsys=daemon sync-build=true
2023-03-30T07:58:16.064819065Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon
2023-03-30T07:58:17.063078792Z level=info msg="Create endpoint request" addressing="&{10.0.55.133/25   }" containerID=1e2f65a1dcbcea1fa412aaf7ce01cc940d374a57f09805e52ec6644ffe495676 datapathConfiguration="&{false true true false true 0xc00115d730}" interface=lxcbc2d4aec349e k8sPodName=default/details-v1-7d4d9d5fcb-7nvxc labels="[]" subsys=daemon sync-build=true
2023-03-30T07:58:17.063098679Z level=warning msg="Creation of endpoint failed due to invalid data" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=0 error="IP ipv4:10.0.55.133 is already in use" ipv4= ipv6= k8sPodName=/ subsys=daemon

Corresponding kube event pattern:

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "95c524ebf566231186d5edfde3262c0bf11f921354040ee5044355ba8189d238": plugin type="cilium-cni" failed (add): Unable to create endpoint: [PUT /endpoint/{id}][400] putEndpointIdInvalid IP ipv4:10.0.55.133 is already in use
Source
kubelet aks-nodepool1-37139537-vmss000000
Count
2
Sub-object
Last seen
2023-03-30T07:58:59Z
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cafec58e7d647c3900ae8165904ddc2e002a22838c7b72a2e8ed13b093217f26": plugin type="cilium-cni" failed (add): Unable to create endpoint: [PUT /endpoint/{id}][400] putEndpointIdInvalid IP ipv4:10.0.55.133 is already in use
Source
kubelet aks-nodepool1-37139537-vmss000000
Count
1
Sub-object
Last seen
2023-03-30T07:58:30Z
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b41a828ec0f791d479b3b3c72cfcb35167d654647a33e9c145775d41b2215d5b": plugin type="cilium-cni" failed (add): Unable to create endpoint: [PUT /endpoint/{id}][400] putEndpointIdInvalid IP ipv4:10.0.55.133 is already in use
Source
kubelet aks-nodepool1-37139537-vmss000000
Count
1
Sub-object
Last seen
2023-03-30T07:58:16Z
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "17efd1d625e451ec81040ab99c6dc70e583d2a5e28ff1ed342a1b33252651933": plugin type="cilium-cni" failed (add): Unable to create endpoint: [PUT /endpoint/{id}][400] putEndpointIdInvalid IP ipv4:10.0.55.133 is already in use
Source
kubelet aks-nodepool1-37139537-vmss000000
Count
1
Sub-object
Last seen
2023-03-30T07:58:04Z
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "25f95864c55dc647d79ab9bf7d957fc7b6c35ba5fe26973e77c6bc82108b2554": plugin type="cilium-cni" failed (add): Unable to create endpoint: [PUT /endpoint/{id}][400] putEndpointIdInvalid IP ipv4:10.0.55.133 is already in use
Source
kubelet aks-nodepool1-37139537-vmss000000
Count
1
Sub-object
Last seen
2023-03-30T07:57:50Z

Can you share AKS Cluster FQDN?

kuzhao commented

Can you share AKS Cluster FQDN?

Just sent over private channel

@kuzhao we recently fixed this issue and started the rollout. @kuzhao can you send me fqdn as well to confirm?

kuzhao commented

@kuzhao we recently fixed this issue and started the rollout. @kuzhao can you send me fqdn as well to confirm?

ok lemme double check in my cluster with a few more deployments. Thanks!
Will close this once validated.