GoogleCloudPlatform/gke-autoneg-controller

Manually removed backend does not add itself back

Closed this issue · 6 comments

Hello,

After performing some tests, I've noticed that if a backend added to a backend service by the autoneg controller is manually removed, it is not added back, that's to say the controller is not constantly checking to ensure the desired state is real.

It would be desirable if the autoneg system periodically checked to make sure that desired state is real.

Steps to reproduce:

  1. run autoneg controller and sync a NEG to a backend service
  2. manually remove the backend NEG from the backend service, observe that it does not get added back by the autoneg controller

The only way for the autoneg controller to add them back is to delete the NEG-services and re create them

There is an annotation on the service that autoneg-controller looks at to see if it's part of a backend. It looks like it takes that annotation as authoritative even if the state of the backend doesn't actually contain the NEG.

I've seen this happen where it adds the annotation, tries to add the NEG to the backend, appears to fail (the NEG isn't part of the backend), but the annotation remains and it never recovers.

@naseemkullah @rwkarg what would you think if we implemented a "continuous reconcile" where:

  1. autoneg would go through a reconcile loop every x seconds
  2. during that loop, autoneg would ensure that the cluster local NEGs were part of the backend service

Does that approach make sense?

How long should we set loop interval? I'd think something quite long like 10m or 30m as it will be a source of API quota burn, and in place only as a backstop preventing long-term skew.

10-15 min. should be a good starting point. I've done a similar thing for a controller that reconciles regional LB backends and it's been working well. Every watch notification for a namespace/service will reset an idle timer for that service to periodically reconcile. Clearing the finalizer stops that periodic timer for the given namespace/service.

Another option is to apply the backend update first, then add the status annotation. I would assume that the annotation apply would be less likely to fail than adding the backend(s). The common case for the status annotation to fail is probably that there was another update to the service resource and autoneg is trying to update from a stale snapshot. In that case, it would be desirable to get the latest version of the service resource, reconcile again, and then apply the status annotation.

This would ensure that the status annotation only gets written after successfully updating backends.

I believe this has been fixed with the latest master. Could you check this?