akuity/kargo

Infinite loop when promoting image

stephaneetje opened this issue · 25 comments

Hello,

I've been trying to use kargo to only promote image tags (using helm-image-update) but on all my stages, i get an inifnite loop at the argocd-update step.

Here is my stage:

      steps:
        - config:
            checkout:
              - branch: main
                path: ./src
            repoURL: https://github.com/test/test-helm.git
          uses: git-clone
        - as: update-image
          config:
            images:
              - image: europe-docker.pkg.dev/test-001/test/scim
                key: image.tag
                value: Tag
            path: ./src/helm/scim/values-scim-staging-image.yaml
          uses: helm-update-image
        - as: commit
          config:
            messageFromSteps:
              - update-image
            path: ./src
          uses: git-commit
        - config:
            path: ./src
          uses: git-push
        - config:
            apps:
              - name: scim-staging
                sources:
                  - desiredCommitFromStep: commit
                    repoURL: https://github.com/test/test-helm.git
          uses: argocd-update

I also tried removing desiredCommitFromStep but didn't change anything.
My warehouse is only subscribing an image repository:

apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "3"
    kargo.akuity.io/color: black
  name: scim-prod
  namespace: scim
spec:
  subscriptions:
  - image:
      imageSelectionStrategy: SemVer
      repoURL: europe-docker.pkg.dev/test-01/test/scim
      discoveryLimit: 10
      allowTags: ^v(\d+\.)?(\d+\.)?(\*|\d+)$

You can see in the Argocd app's envents that it is looping
Capture d’écran 2024-11-20 à 12 38 54

Discussion has began on this issue, opening another one as asked by @krancour

@stephaneetje my apologies, as I seem to have misunderstood the nature of your loop. With the previous discussion having taken place in a thread about desiredCommit, I was misled into believing that to have been an actual factor in your problem. If you leave that field alone, as you are currently, AND if there is no commit in your Freight, then when syncing an App, Kargo has no notion of what specific revision that App should observably be synced to once complete.

This means the cause of your loop has nothing to do with a mismatch of the desired revision.

The continuous syncing is probably due to the argocd-update step syncing, then when checking to see if the sync completed, discovering that the most recently completely sync wasn't the one it asked for. It's waiting for its request to complete and seeing a sync completed, but that operation must have been initiated by someone or something else, so it tries again.

Do you happen to have auto-sync enabled on your Apps? It is possible that is what the update step is fighting a losing battle against. We do generally suggest disabling autosync oh Apps when you use Kargo. I will address it in the docs to make that more clear.

If this ends up not being the issue, enabling debug logging on the controller should provide more insight into what conditions are causing the update step to be caught in this sync loop.

I did deactivate autosync.
I already enabled DEBUG logs, but it didn't help:

time="2024-11-20T11:44:58Z" level=debug msg="enqueued Promotion for reconciliation" app=scim-staging namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:44:58Z" level=debug msg="reconciling Promotion" namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:44:58Z" level=debug msg="continuing Promotion" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg="executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg="patched Argo CD Application" app=scim-staging freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg="done executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg=promotion freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim phase="\"Running\"" promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging targetFreight=176d90ae395fe87b5d0117b96ce24cf135601a11
time="2024-11-20T11:45:35Z" level=debug msg="enqueued Promotion for reconciliation" app=scim-staging namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:35Z" level=debug msg="reconciling Promotion" namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:35Z" level=debug msg="continuing Promotion" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg="executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg="patched Argo CD Application" app=scim-staging freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg="done executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg=promotion freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim phase="\"Running\"" promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging targetFreight=176d90ae395fe87b5d0117b96ce24cf135601a11
time="2024-11-20T11:45:50Z" level=debug msg="enqueued Promotion for reconciliation" app=scim-staging namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:50Z" level=debug msg="reconciling Promotion" namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:50Z" level=debug msg="continuing Promotion" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg="executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg="patched Argo CD Application" app=scim-staging freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg="done executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg=promotion freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim phase="\"Running\"" promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging targetFreight=176d90ae395fe87b5d0117b96ce24cf135601a11

EDIT:
I verified and sync is comming from Kargo (INITIATED BY: automated sync policy)

Capture d’écran 2024-11-20 à 14 30 50

I verified and sync is comming from Kargo

Oh. I had no doubt of that.

The mystery is why Kargo is syncing over and over again -- which would usually be because:

  1. It initiates a sync operation
  2. It follows up on the status of the last operation
  3. Finds something unsatisfactory about the status of the last operation
  4. Tries again

I'm trying to piece together what's going wrong at step 3 and it was conjecture that something besides Kargo might (additionally) be initiating sync operations.

I am attempting to replicate this.

Running a build with #2936 in it should make it much easier to trace the reason without having to try to reproduce it.

Also, I have tried quite hard to replicate this.

I have 13 Projects x 3 Stages per project, each with an associated Argo CD App.

All 39 Stages are reading from and writing to just one single branch.

Here's a gist of what one Project looks like:

https://gist.github.com/krancour/dafa7ded1b4db4c0683c05e7f01129e8

There are 12 more just like it.

I've been unable to trigger any kind of infinite sync loop.

I'll note the problem I did run into:

With lots and lots of Promos running concurrently, all racing to push to the same remote branch, unsurprisingly, some of those failed because in the space between a Promotion checking out the branch and pushing to it, some other Promotion pushed its own commit. Re-trying any Promotion that failed for these reasons put things back on track. This will be a problem for anyone who has many Stages making updates to a single branch and a large volume of Promotions. (Add this to the list of problems that are solved for free by embracing the notion of branches as env-specific config storage.)

Not to get too sidetracked here, but there's a limited amount we can do about the above because Promotion processes are user-defined and the actions carried out by each step are opaque to the rest of Kargo. This means there's no possibility of "try again starting from step n," but I will open a separate issue with a proposal for a "FailurePolicy" that could have, among other options, the option to start over from step 0.

Circling back to the subject at hand, as I said, I couldn't reproduce that issue, so it would be interesting @stephaneetje if you are able to identify what you and I have done differently. I attempted to replicate your configuration as best I could, but there there's clearly some consequential difference.

Running a build with #2936 in it should make it much easier to trace the reason without having to try to reproduce it.

ghcr.io/hiddeco/kargo:1.0.4-rc.66550612 built from our release-1.0 branch should log the reason for the infinite loop, if any.

I have the same problem and this is what I found in the logs:

time="2024-11-20T18:21:15Z" level=debug msg="sync result revisions [0b8559a96d3441da09b2c84f6bd85f8fa06e56d1] do not match desired revisions [62429f4247599ff1746023c29e4514e24dee322e]" freight=d82c676db847dce25b54c3328801ec10fae9592b namespace=kargo-poc-tags promotion=teststable-tags.01jd5cw76w9waj367f69rsk1ek.d82c676 stage=teststable-tags

While 0b8559 being the head of the branch (with new manifests in a separate folder) and 62429f being the commit the freight originated from.

Is it by any chance related to using the same source branch to merge the results back to?

@krancour i just tried with only the name :

      - uses: argocd-update
        config:
          apps:
            - name: scim-staging

Step passes well, but i end up with the application being synced successfully, then outofsync (image has not been updated). I checked the revision used by the sync, it is the good one with the image change.

Capture d’écran 2024-11-20 à 20 24 42 Capture d’écran 2024-11-20 à 20 24 57

I have to do another manual sync

Also kargo documentation sais that apps[].sources[].repoURL is mandatory, thats why i had it before, and because i thought that would make sure the synced commit is the one from precedent step.

Step passes well, but i end up with the application being synced successfully, then outofsync (image has not been updated). I checked the revision used by the sync, it is the good one with the image change.

This has the appearance of an Argo CD quirk/issue and not a Kargo issue.

Also kargo documentation sais that apps[].sources[].repoURL is mandatory, thats why i had it before, and because i thought that would make sure the synced commit is the one from precedent step.

Ahhhhh. This explains everything. 😄 I had not noticed that source update.

sources isn't mandatory. repoURL is mandatory on each source, if you specify any, that is. That is for updating individual sources of the Application -- which is something you have no need to do. All you want to do is trigger a sync.

When you do perform any kind of updates on a source, there's a whole additional layer of stuff that happens when checking if the sync was successful. It compares what the sources look like to what you want them to look like. The discrepancy was here (somehow?) and because of it, the sync was being re-attempted over and over again because it kept looking like it hadn't worked.

Step passes well, but i end up with the application being synced successfully, then outofsync (image has not been updated). I checked the revision used by the sync, it is the good one with the image change.

This has the appearance of an Argo CD quirk/issue and not a Kargo issue.

Also kargo documentation sais that apps[].sources[].repoURL is mandatory, thats why i had it before, and because i thought that would make sure the synced commit is the one from precedent step.

Ahhhhh. This explains everything. 😄 I had not noticed that source update.

sources isn't mandatory. repoURL is mandatory on each source, if you specify any, that is. That is for updating individual sources of the Application -- which is something you have no need to do. All you want to do is trigger a sync.

When you do perform any kind of updates on a source, there's a whole additional layer of stuff that happens when checking if the sync was successful. It compares what the sources look like to what you want them to look like. The discrepancy was here (somehow?) and because of it, the sync was being re-attempted over and over again because it kept looking like it hadn't worked.

I opened an Argocd issue, but i feel like kargo is syncing before refreshing when i look at the application avents

Capture d’écran 2024-11-21 à 09 19 14

Does Kargo wait for refresh to finish before sending a sync signal?

I opened an Argocd issue, but i feel like kargo is syncing before refreshing when i look at the application avents

Does Kargo wait for refresh to finish before sending a sync signal?

We patch the Application with a sync operation. When Argo CD syncs, it is smart about not relying on its cache. So the short answer is no. The longer answer is, "no because there's no need."

After the sync is initiated, Kargo follows up on it as many times as necessary to find evidence that the exact sync operation it initiated has since completed.

Seems like app starts showing out of sync only when sync is already running, indicating there can be some race. Idk about exact reasons though.

We patch the Application with a sync operation.

Do you mean you patch the application sync operation directly instead of calling the server endpoint to do so? That might explain what we saw when trying to repro using ArgoCD CLI and being unable to. I'm pretty sure some steps are missing to do sync properly. How hard would it be to just trigger sync via endpoint/calling the code instead of patching the application?

In particular, the observed behavior in ArgoCD UI is that Sync is clickable, shows a new duration, but shows old sync's data for a few seconds. This doesn't look intended.

This doesn't sound like a Kargo problem.

Note that #2980 should probably fix the original problem described by this thread.

@krancour, I don't think ArgoCD is expected to work correctly with surgical replacements of things that are supposed to be done by using more high-level APIs. The evidence of showing both old and new data in the Sync info for a bit only supports that, at least that Kargo doesn't update the sync operation on the app properly. But operation state seems like an internal field managed by ArgoCD, so manually updating it doesn't seem like it should be done. It's like using library's internal function.

I don't think ArgoCD is expected to work correctly with surgical replacements of things that are supposed to be done by using more high-level APIs

I'm afraid this is incorrect. When you sync an App via Argo CD API, it's merely a request to sync and is handled the same way we do it. i.e. It patches the application with a sync operation:

https://github.com/argoproj/argo-cd/blob/32cc6638f78e15652c24eb2cd6403e6e7ec26ab0/server/application/application.go#L1968-L1993

In either case, it is the Application controller that is doing all the actual work.

The original issue here should be closed by #2980.

How does Kargo patch the manifest? Can you point me to the code, please? What if ArgoCD controller changes the way it patches?

Kargo is indeed setting the .operation field in the App object. While this might seem potentially fragile, this was an intentional design choice so that Kargo could avoid interfacing with Argo CD API server entirely, which would require supplying Kargo with Argo CD auth credentials, as well as other tradeoffs.

What if ArgoCD controller changes the way it patches?

Kargo would have to adapt. We do copy the implementation:

However, I'm comfortable with this design choice, having wrote the original Argo CD code and understanding that the sync operation logic has not changed in incompatible way since basically its inception.

As Kent already pointed out, we are merely requesting the sync operation by setting the initial .operation field but the controller is the one updating that field thereafter. What Kargo is doing is pretty safe, and relatively future proof. Any future Argo CD changes we would be aware of ahead of time and prepared to deal with.

@jessesuen, the Kargo link points to ArgoCD repo. I think I've found the relevant code though

func (a *argocdUpdater) syncApplication(
.

When comparing it with SetAppOperation, I don't see an analog to clearing app.Status.OperationState https://github.com/argoproj/argo-cd/blob/32cc6638f78e15652c24eb2cd6403e6e7ec26ab0/util/argo/argo.go#L820. Maybe that what's leading to previous sync metadata being shown in a case @stephaneetje described.

I've seen different approaches for constructing the operation in server and app controller https://github.com/argoproj/argo-cd/blob/32cc6638f78e15652c24eb2cd6403e6e7ec26ab0/server/application/application.go#L1968-L1985, https://github.com/argoproj/argo-cd/blob/32cc6638f78e15652c24eb2cd6403e6e7ec26ab0/controller/appcontroller.go#L1987-L2051. Comparing with manual sync in server, Kargo doesn't set prune for example. Comparing with automated sync in controller, it doesn't do checks for already attempted syncs, tho some of it is specific to autosync's options. There are more differences, but I'm not sure they'd cause the behavior described.

Overall, I think clearing the app.Status.OperationState needs to be added. As for Prune etc., not sure if Kargo has controls for this, but should be updates as well if yes.

Comparing with manual sync in server, Kargo doesn't set prune for example.

This is correct. There are options it doesn't set. There is no reason that certain high-value options cannot be surfaced in the argocd-update step. Please open a separate issue please if there is a setting you mean to ask for.

Overall, I think clearing the app.Status.OperationState needs to be added.

It's not clear what you think this will accomplish.

Great care is taken already to verify that a completed operation is the exact operation we requested. This is accomplished by comparing the details of app.Status.OperationState (set by the Application controller) against the details of the Promotion. If it doesn't match, we request a new sync. (This is actually how the infinite loop, which was the original subject of this thread, was happening. There was a discrepancy in the details and that has been fixed.)

Here is the relevant code:

func (a *argocdUpdater) mustPerformUpdate(
stepCtx *PromotionStepContext,
update *ArgoCDAppUpdate,
app *argocd.Application,
desiredSources argocd.ApplicationSources,
) (phase argocd.OperationPhase, mustUpdate bool, err error) {
status := app.Status.OperationState
if status == nil {
// The application has no operation.
return "", true, nil
}
// Deal with the possibility that the operation was not initiated by the
// expected user.
if status.Operation.InitiatedBy.Username != applicationOperationInitiator {
// The operation was not initiated by the expected user.
if !status.Phase.Completed() {
// We should wait for the operation to complete before attempting to
// apply an update ourselves.
// NB: We return the current phase here because we want the caller
// to know that an operation is still running.
return status.Phase, false, fmt.Errorf(
"current operation was not initiated by %q and not by %q: waiting for operation to complete",
applicationOperationInitiator, status.Operation.InitiatedBy.Username,
)
}
// Initiate our own operation.
return "", true, nil
}
// Deal with the possibility that the operation was not initiated for the
// current freight collection. i.e. Not related to the current promotion.
var correctFreightColIDFound bool
for _, info := range status.Operation.Info {
if info.Name == freightCollectionInfoKey {
correctFreightColIDFound = info.Value == stepCtx.Freight.ID
break
}
}
if !correctFreightColIDFound {
// The operation was not initiated for the current freight collection.
if !status.Phase.Completed() {
// We should wait for the operation to complete before attempting to
// apply an update ourselves.
// NB: We return the current phase here because we want the caller
// to know that an operation is still running.
return status.Phase, false, fmt.Errorf(
"current operation was not initiated for freight collection %q: waiting for operation to complete",
stepCtx.Freight.ID,
)
}
// Initiate our own operation.
return "", true, nil
}
if !status.Phase.Completed() {
// The operation is still running.
return status.Phase, false, nil
}
if status.SyncResult == nil {
// We do not have a sync result, so we cannot determine if the operation
// was successful. The best recourse is to retry the operation.
return "", true, errors.New("operation completed without a sync result")
}
// Check if the desired sources were applied.
if len(update.Sources) > 0 {
if (status.SyncResult.Source.RepoURL != "" && !status.SyncResult.Source.Equals(&desiredSources[0])) ||
(status.SyncResult.Source.RepoURL == "" && !status.SyncResult.Sources.Equals(desiredSources)) {
// The operation did not result in the desired sources being applied. We
// should attempt to retry the operation.
return "", true, fmt.Errorf(
"operation result source does not match desired source",
)
}
}
// Check if the desired revisions were applied.
desiredRevisions, err := a.getDesiredRevisions(stepCtx, update, app)
if err != nil {
return "", true, fmt.Errorf("error determining desired revisions: %w", err)
}
if len(desiredRevisions) == 0 {
// We do not have any desired revisions, so we cannot determine if the
// operation was successful.
return status.Phase, false, nil
}
observedRevisions := status.SyncResult.Revisions
if len(observedRevisions) == 0 {
observedRevisions = []string{status.SyncResult.Revision}
}
for i, observedRevision := range observedRevisions {
desiredRevision := desiredRevisions[i]
if desiredRevision == "" {
continue
}
if observedRevision != desiredRevision {
return "", true, fmt.Errorf(
"sync result revisions %v do not match desired revisions %v",
observedRevisions, desiredRevisions,
)
}
}
// The operation has completed.
return status.Phase, false, nil
}

Edit: It's also not correct to set status on a resource we're not the controller for.

This is correct. There are options it doesn't set. There is no reason that certain high-value options cannot be surfaced in the argocd-update step. Please open a separate issue please if there is a setting you mean to ask for.

Here we go #2984.

It's not clear what you think this will accomplish.

Parity with how ArgoCD does things, as well as consistency of the state. If the previous sync has finished and we add a new operation, keeping old status's operation state would not reflect the operation and probably results in the UI inconsistency showing new sync started, but state of the previous sync.

I'll open a separate issue for this though.