syncContext.runTasks doesn't wait for pruning to complete and creates the same resource again
Opened this issue · 0 comments
ashi009 commented
We recently encountered an issue when using tigera-operator which ends up in infinity syncing:
time | log |
---|---|
2023-08-13T20:00:30Z | Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request' |
2023-08-13T20:00:30Z | Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request' |
2023-08-13T20:00:40Z | Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request' |
2023-08-13T20:00:40Z | Adding resource result, status: 'SyncFailed', phase: '', message: 'the server is currently unable to handle the request' |
2023-08-13T20:01:02Z | Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned' |
2023-08-13T20:01:02Z | Adding resource result, status: 'Synced', phase: 'Running', message: 'installation.operator.tigera.io/default serverside-applied' |
2023-08-13T20:01:02Z | Adding resource result, status: 'Synced', phase: 'Running', message: 'felixconfiguration.projectcalico.org/default serverside-applied. Warning: Detected changes to resource default which is currently being deleted.' |
After pruning the CR felixconfiguration.projectcalico.org/default
, it tries to sync to that CR again. However that CR is pending for deletion, thus API server returns a warning on that. Which then causes controller to assume the sync succeeded (though the CR got removed for real after a while,) then it tries to sync over and over again, and keep having this pattern:
time | log |
---|---|
2023-08-13T20:02:09Z | Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned' |
2023-08-13T20:02:09Z | Adding resource result, status: 'Synced', phase: 'Running', message: 'felixconfiguration.projectcalico.org/default serverside-applied. Warning: Detected changes to resource default which is currently being deleted.' |
This was resolved by restarting the controller itself:
time | log |
---|---|
2023-08-14T04:41:14Z | Refreshing app status (controller refresh requested), level (1) |
2023-08-14T04:41:14Z | Comparing app state (cluster: https://kubernetes.default.svc, namespace: calico-system) |
2023-08-14T04:41:14Z | getRepoObjs stats |
2023-08-14T04:41:14Z | Initiated automated sync to '0.1.6' |
2023-08-14T04:41:14Z | Initialized new operation: {&SyncOperation{Revision:0.1.6,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:projectcalico.org,Kind:FelixConfiguration,Name:default,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[ServerSideApply=true],} { true} [] {5 nil}} |
2023-08-14T04:41:14Z | Comparing app state (cluster: https://kubernetes.default.svc, namespace: calico-system) |
2023-08-14T04:41:14Z | Update successful |
2023-08-14T04:41:14Z | Reconciliation completed |
2023-08-14T04:41:14Z | getRepoObjs stats |
2023-08-14T04:41:14Z | Syncing |
2023-08-14T04:41:14Z | Tasks (dry-run) |
2023-08-14T04:41:15Z | Refreshing app status (controller refresh requested), level (1) |
2023-08-14T04:41:15Z | Comparing app state (cluster: https://kubernetes.default.svc, namespace: calico-system) |
2023-08-14T04:41:15Z | Updating operation state. phase: Running -> Running, message: '' -> 'one or more tasks are running' |
2023-08-14T04:41:15Z | Adding resource result, status: 'Synced', phase: 'Running', message: 'felixconfiguration.projectcalico.org/default serverside-applied' |
2023-08-14T04:41:15Z | Updating operation state. phase: Running -> Succeeded, message: 'one or more tasks are running' -> 'successfully synced (all tasks run)' |
2023-08-14T04:41:15Z | sync/terminate complete |
This time it didn't prune first, and went ahead to sync it directly.