argoproj/gitops-engine

syncContext.runTasks doesn't wait for pruning to complete and creates the same resource again

Opened this issue · 0 comments

We recently encountered an issue when using tigera-operator which ends up in infinity syncing:

time log
2023-08-13T20:00:30Z Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request'
2023-08-13T20:00:30Z Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request'
2023-08-13T20:00:40Z Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request'  
2023-08-13T20:00:40Z Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request'
2023-08-13T20:01:02Z Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'
2023-08-13T20:01:02Z Adding resource result, status: 'Synced', phase: 'Running', message: 'installation.operator.tigera.io/default serverside-applied'
2023-08-13T20:01:02Z Adding resource result, status: 'Synced', phase: 'Running', message: 'felixconfiguration.projectcalico.org/default serverside-applied. Warning: Detected changes to resource default which is currently being deleted.'

After pruning the CR felixconfiguration.projectcalico.org/default, it tries to sync to that CR again. However that CR is pending for deletion, thus API server returns a warning on that. Which then causes controller to assume the sync succeeded (though the CR got removed for real after a while,) then it tries to sync over and over again, and keep having this pattern:

time log
2023-08-13T20:02:09Z Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'
2023-08-13T20:02:09Z Adding resource result, status: 'Synced', phase: 'Running', message: 'felixconfiguration.projectcalico.org/default serverside-applied. Warning: Detected changes to resource default which is currently being deleted.'

This was resolved by restarting the controller itself:

time log
2023-08-14T04:41:14Z Refreshing app status (controller refresh requested), level (1)
2023-08-14T04:41:14Z Comparing app state (cluster: https://kubernetes.default.svc, namespace: calico-system)
2023-08-14T04:41:14Z getRepoObjs stats
2023-08-14T04:41:14Z Initiated automated sync to '0.1.6'
2023-08-14T04:41:14Z Initialized new operation: {&SyncOperation{Revision:0.1.6,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:projectcalico.org,Kind:FelixConfiguration,Name:default,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[ServerSideApply=true],} { true} [] {5 nil}}
2023-08-14T04:41:14Z Comparing app state (cluster: https://kubernetes.default.svc, namespace: calico-system)
2023-08-14T04:41:14Z Update successful
2023-08-14T04:41:14Z Reconciliation completed
2023-08-14T04:41:14Z getRepoObjs stats
2023-08-14T04:41:14Z Syncing
2023-08-14T04:41:14Z Tasks (dry-run)
2023-08-14T04:41:15Z Refreshing app status (controller refresh requested), level (1)
2023-08-14T04:41:15Z Comparing app state (cluster: https://kubernetes.default.svc, namespace: calico-system)
2023-08-14T04:41:15Z Updating operation state. phase: Running -> Running, message: '' -> 'one or more tasks are running'
2023-08-14T04:41:15Z Adding resource result, status: 'Synced', phase: 'Running', message: 'felixconfiguration.projectcalico.org/default serverside-applied'
2023-08-14T04:41:15Z Updating operation state. phase: Running -> Succeeded, message: 'one or more tasks are running' -> 'successfully synced (all tasks run)'
2023-08-14T04:41:15Z sync/terminate complete

This time it didn't prune first, and went ahead to sync it directly.