k8snetworkplumbingwg/whereabouts

[BUG] Duplicate IPs and empty IPPools

Opened this issue · 3 comments

Describe the bug
We are frequently ending up with pods being assigned the same IP, which results in many dueling ARP entries, and much consternation and gnashing of teeth.

We are using the kubernetes method for keeping track of allocations, not the etcd method.

A symptom of this is that the IPPool CRDs frequently end up empty as well.

Expected behavior
Each pod always has a unique IP address and the IPPool CRDs accurately reflect IP allocation.

To Reproduce
We have not managed to reproduce this behavior in a controlled environment. I've tried many different things to make it happen, and none of them have worked. :-(

Environment:

  • Whereabouts version : v0.6.1
  • Kubernetes version (use kubectl version): 1.26.1
  • Network-attachment-definition: Proprietary
  • Whereabouts configuration (on the host):
{
  "datastore": "kubernetes",
  "kubernetes": {
    "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
  },
  "reconciler_cron_expression": "30 4 * * *"
}
  • OS (e.g. from /etc/os-release): Talos
  • Kernel (e.g. uname -a): Linux talos-5i3-rhe 5.15.102-talos #1 SMP Mon Mar 13 18:10:38 UTC 2023 x86_64 Linux
  • Others: N/A

Additional info / context
Pod or node churn seem important. We haven't been able to catch the problem 'in the act' yet though, so...

wwcd commented

I had the same issue when the node powered down.

In this scenario, the pod is reborn on other nodes, and there will be a reappeared podref in the ippools. If recycling occurs regularly, there is a probability that the newly created pod address will be recycled incorrectly, resulting in repeated address allocation.

The associated code is as follows。

func getPodRefsServedByWhereabouts(ipPools []storage.IPPool) map[string]void {

wwcd commented

The problematic code may be analyzed as follows

  1. Assume that after the node is powered off, the pod is reborn with the following ippools data
     "18278": <--pod powered off by node, leaked address
       id: 051a974db72a38d2c226aebbf46f596c62c95e2c527e7dae51d6df52d252bb51
       podref: xxx/pod-1
     "18279": <--Newly generated pod, newly applied address
       id: b2a431a89ae18c3132eb3708306a901ea5999da64f10f0a23f14ede25602c01c
       podref: xxx/pod-1
  1. Decide whether to recycle the process. Since the data returned by pool.Allocations() is a list transferred from map, the order of the returned list members is not fixed. It may be [18278, 18279], or it may be [18279, 18278] , this step assumes that the returned value is [18278, 18279]. At this time, isPodAlive returns false, because the address of 18278 is inconsistent with the current real pod address, so the decision is xxx/pod-1 to be recycled
func (rl *ReconcileLooper) findOrphanedIPsPerPool(ipPools []storage.IPPool) error {
     for _, pool := range ipPools {
         orphanIP := OrphanedIPReservations{
             Pool: pool,
         }
         for _, ipReservation := range pool.Allocations() {//The order of the returned list is not fixed
             logging.Debugf("the IP reservation: %s", ipReservation)
             if ipReservation.PodRef == "" {
                 _ = logging.Errorf("pod ref missing for Allocations: %s", ipReservation)
                 continue
             }
             if !rl.isPodAlive(ipReservation.PodRef, ipReservation.IP.String()) {
                 logging.Debugf("pod ref %s is not listed in the live pods list", ipReservation.PodRef)
                 orphanIP.Allocations = append(orphanIP.Allocations, ipReservation)
             }
         }
         if len(orphanIP.Allocations) > 0 {
             rl.orphanedIPs = append(rl.orphanedIPs, orphanIP)
         }
     }

     return nil
}
  1. When the address recycling is executed, orphanedIP.Pool.Allocations() returns [18279, 18278] because the order is not fixed, causing allocate.IterateForDeallocation to recycle the currently used 18279 during recycling.
func (rl ReconcileLooper) ReconcileIPPools(ctx context.Context) ([]net.IP, error) {
     matchByPodRef := func(reservations []types.IPReservation, podRef string) int {
         foundidx := -1
         for idx, v := range reservations {
             if v.PodRef == podRef {
                 returnidx
             }
         }
         return foundidx
     }

     var err error
     var totalCleanedUpIps[]net.IP
     for _, orphanedIP := range rl.orphanedIPs {
         currentIPReservations := orphanedIP.Pool.Allocations()//The order of the returned list is not fixed
         podRefsToDeallocate := findOutPodRefsToDeallocateIPsFrom(orphanedIP)
         var deallocatedIP net.IP
         for _, podRef := range podRefsToDeallocate {
             currentIPReservations, deallocatedIP, err = allocate.IterateForDeallocation(currentIPReservations, podRef, matchByPodRef)//may cause the correct address to be recycled
             if err != nil {
                 return nil, err
             }
         }

         logging.Debugf("Going to update the reserve list to: %+v", currentIPReservations)
         if err := orphanedIP.Pool.Update(ctx, currentIPReservations); err != nil {
             logging.Errorf("failed to update the reservation list: %v", err)
             continue
         }
         totalCleanedUpIps = append(totalCleanedUpIps, deallocatedIP)
     }

     return totalCleanedUpIps, nil
}

This looks interesting. But I don't understand enough about what's going on here to have a strong opinion as to whether or not this is the issue. Though, it does suggest some ideas for reproducing the issue.

I would like to also mention that we do use the standard practice of having a DaemonSet that installs the CNI and keeps a reconciler/garbage collector running.

We are using multus to give the pod several interfaces. The primary interface is using the Flannel CNI, and it works fine. The other two are using a macvlan CNI with an IP being assigned by whereabouts. And yes, we know that the IP ranges the IPs are being assigned from don't overlap. And we consistently use the same NetworkAttachmentDefinitions, so we don't really have a way to have lots of ranges anyway.

And we are seeing a number of odd log messages, though it's not clear that any particular one is related. One log message we see is this:

│ E1211 15:41:37.094624 24 leaderelection.go:330] error retrieving resource lock /whereabouts: an empty namespace may not be set when a resource name is provided