akuity/kargo

Removing `desiredCommitFromStep: commit` from argocd-update causes infinite sync loop

Closed this issue · 18 comments

Description

I have multiple Stages that are committing to the same branch (e.g. HEAD). Per @hiddeco suggestion, I removed the desiredCommitFromStep: commit option, since a stage writing to the same HEAD would cause the other to have an broken heart

        - uses: argocd-update
          config:
            apps:
              - name: kargo-helm-ff-dev
                sources:
                # desiredCommitFromStep: commit     # << removing this causes an infinite sync loop

However, when syncing, Kargo went into an infinite loop sync-ing the application over and over again

Screenshots

image

Steps to Reproduce

kind: Stage
metadata:
  name: dev
  namespace: kargo-helm-ff
  labels:
    kargo.akuity.io/shard: kargo-agent
spec:
  shard: kargo-agent
  requestedFreight:
    - origin:
        kind: Warehouse
        name: guestbook
      sources:
        direct: true
    - origin:
        kind: Warehouse
        name: features
      sources:
        direct: true
  promotionTemplate:
    spec:
      steps:
        - uses: git-clone
          config:
            checkout:
              - fromFreight: true
                path: ./src
              - branch: main
                path: ./out
            repoURL: https://github.com/dhpup/kargo-helm-ff.git
        - uses: helm-update-image
          as: update-image
          config:
            images:
              - image: ghcr.io/dhpup/guestbook
                key: image.tag
                value: Tag
            path: ./out/env/dev/values.yaml
        - uses: copy
          config:
            inPath: ./src/base/feature-flags.yaml
            outPath: ./out/env/dev/feature-flags.yaml
        - uses: git-commit
          as: commit
          config:
            messageFromSteps:
              - update-image
            path: ./out
        - uses: git-push
          config:
            path: ./out
        - uses: argocd-update
          config:
            apps:
              - name: kargo-helm-ff-dev
                sources:
                  - repoURL: https://github.com/dhpup/kargo-helm-ff.git

Version

v1.0.3

Logs

Paste any relevant application logs here.

This actually matches the behavior described in the docs:

https://docs.kargo.io/references/promotion-steps#argocd-update

If this is left undefined, the desired revision will be determined by Freight (if possible).

Typically, you'd let Kargo figure this out all on its own, but the field is useful for the scenario where you've written back to Git and the commit you ought to be synced to therefor doesn't match the commit that's in the freight. (In the legacy promo mechanisms world, this is what "health check commit" was for.)

So by not specifying that, Kargo is trying to figure out what the App should be synced to all on its own (and probably coming up with the wrong answer).

The plot thickens: desiredCommitFromStep is being deprecated in 1.1 and removed in 1.2. It is not needed because there is a new desiredRevision field and you can set its value however you like using an expression. (You can see that desiredCommitFromStep was always a stop-gap to hold us over until we had EL support.)

This being said, we will now need to grapple with how we want undefined desiredRevision to be handled.

  1. Intuitive: If it's left blank, set no expectation for what the App should be synced to But this is a breaking change in behavior. That said, since there's already a small, advertised-in-advance breakage going on here (removal of desiredCommitFromStep) perhaps the timing is right for this other breakage?

  2. Non-breaking: Add a separate boolean field that can be set to indicate you don't care about verifying what revision the App is synced to.

I think both are easily doable, so @jessesuen just let me know which direction you'd prefer to head in.

Intuitive: If it's left blank, set no expectation for what the App should be synced to

This is the behavior we were after during the meeting (and what I remembered to be possible at some point).

@hiddeco - checked logs per your request but nothing interesting:

time="2024-11-18T14:43:21Z" level=info msg="Starting Controller" controller=promotion controllerGroup=kargo.akuity.io controllerKind=Promotion
time="2024-11-18T14:43:21Z" level=info msg="Starting workers" controller=warehouse controllerGroup=kargo.akuity.io controllerKind=Warehouse worker count=1
time="2024-11-18T14:43:21Z" level=info msg="Starting workers" controller=stage controllerGroup=kargo.akuity.io controllerKind=Stage worker count=1
time="2024-11-18T14:43:21Z" level=info msg="Starting workers" controller=promotion controllerGroup=kargo.akuity.io controllerKind=Promotion worker count=1
time="2024-11-18T14:48:07Z" level=info msg="began promotion" freight=21ba6e3806d2fd4e246a1732ccc9776a48d17228 namespace=kargo-helm-ff promotion=dev.01jczvxh65kb0wkscmygtfjy8y.21ba6e3 stage=dev
time="2024-11-18T14:48:12Z" level=info msg=promotion freight=21ba6e3806d2fd4e246a1732ccc9776a48d17228 namespace=kargo-helm-ff phase="\"Succeeded\"" promotion=dev.01jczvxh65kb0wkscmygtfjy8y.21ba6e3 stage=dev
time="2024-11-18T16:18:50Z" level=info msg="began promotion" freight=93cfe8602013c108db2863e6f854c38717f9d666 namespace=kargo-helm-ff promotion=staging.01jd013macepb5qnez6syhqrq6.93cfe86 stage=staging
time="2024-11-18T16:18:52Z" level=info msg="began promotion" freight=93cfe8602013c108db2863e6f854c38717f9d666 namespace=kargo-helm-ff promotion=prod.01jd013pera23sj3rf994fn6nw.93cfe86 stage=prod
time="2024-11-18T16:18:54Z" level=info msg=promotion freight=93cfe8602013c108db2863e6f854c38717f9d666 namespace=kargo-helm-ff phase="\"Succeeded\"" promotion=staging.01jd013macepb5qnez6syhqrq6.93cfe86 stage=staging
time="2024-11-18T16:18:55Z" level=info msg=promotion freight=93cfe8602013c108db2863e6f854c38717f9d666 namespace=kargo-helm-ff phase="\"Succeeded\"" promotion=prod.01jd013pera23sj3rf994fn6nw.93cfe86 stage=prod
time="2024-11-18T16:19:22Z" level=info msg="began promotion" freight=826f87b0d40b68d9ff8234f91862788929381e9a namespace=kargo-helm-ff promotion=dev.01jd014k59p2617kdza6a9rdn7.826f87b stage=dev
time="2024-11-18T16:19:26Z" level=info msg=promotion freight=826f87b0d40b68d9ff8234f91862788929381e9a namespace=kargo-helm-ff phase="\"Succeeded\"" promotion=dev.01jd014k59p2617kdza6a9rdn7.826f87b stage=dev
time="2024-11-18T16:21:47Z" level=info msg="began promotion" freight=826f87b0d40b68d9ff8234f91862788929381e9a namespace=kargo-helm-ff promotion=dev.01jd01918f2srjjczx52vmzcvb.826f87b stage=dev
time="2024-11-18T16:21:49Z" level=info msg=promotion freight=826f87b0d40b68d9ff8234f91862788929381e9a namespace=kargo-helm-ff phase="\"Succeeded\"" promotion=dev.01jd01918f2srjjczx52vmzcvb.826f87b stage=dev
time="2024-11-18T16:40:52Z" level=info msg="began promotion" freight=21ba6e3806d2fd4e246a1732ccc9776a48d17228 namespace=kargo-helm-ff promotion=dev.01jd02byj4g79417wgybwzh1j3.21ba6e3 stage=dev
time="2024-11-18T16:42:00Z" level=info msg="terminating Promotion" namespace=kargo-helm-ff promotion=dev.01jd02byj4g79417wgybwzh1j3.21ba6e3

Intuitive: If it's left blank, set no expectation for what the App should be synced to

We should make this the behavior

Hello,
I also have the infinite loop problem with all my pipelines.
Here is an example of one of my stages:

      steps:
        - config:
            checkout:
              - branch: main
                path: ./src
            repoURL: https://github.com/test/test-helm.git
          uses: git-clone
        - as: update-image
          config:
            images:
              - image: europe-docker.pkg.dev/test-001/test/scim
                key: image.tag
                value: Tag
            path: ./src/helm/scim/values-scim-staging-image.yaml
          uses: helm-update-image
        - as: commit
          config:
            messageFromSteps:
              - update-image
            path: ./src
          uses: git-commit
        - config:
            path: ./src
          uses: git-push
        - config:
            apps:
              - name: scim-staging
                sources:
                  - desiredCommitFromStep: commit
                    repoURL: https://github.com/test/test-helm.git
          uses: argocd-update

@stephaneetje what branch is the ArgoCD Application configured to sync to?

It's on 'main' branch, here is my appset:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: scim
  annotations:
spec:
  generators:
  - list:
      elements:
      - env: scim-integ
        url: {{ .Values.cluster.test.url }}
        project: scim-integ
        stage: integ
        targetRevision: main
      - env: scim-staging
        url: {{ .Values.cluster.testProd.url }}
        project: scim-staging
        stage: staging
        targetRevision: main
      - env: scim-prod
        url: {{ .Values.cluster.testProd.url }}
        project: scim-prod
        stage: prod
        targetRevision: main
  template:
    metadata:
      name: '{{`{{env}}`}}'
      annotations:
        kargo.akuity.io/authorized-stage: scim:{{`{{stage}}`}}
      labels:
        argocd.argoproj.io/instance: '{{`{{env}}`}}'
    spec:
      source:
        repoURL: {{ .Values.spec.source.githubappRepoURL }}
        targetRevision: '{{`{{targetRevision}}`}}'
        path: helm/scim
        helm:
          valueFiles:
          - 'secrets://values-{{`{{env}}`}}.yaml'
          - 'values-{{`{{env}}`}}-image.yaml'
      project: '{{`{{project}}`}}'
      syncPolicy:
        syncOptions:
          - CreateNamespace=true
      destination:
        server: '{{`{{url}}`}}'
        namespace: '{{`{{env}}`}}'

And here is my warehouse:

apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "3"
    kargo.akuity.io/color: black
  name: scim-prod
  namespace: scim
spec:
  subscriptions:
  - image:
      imageSelectionStrategy: SemVer
      repoURL: europe-docker.pkg.dev/test-001/test/scim
      discoveryLimit: 10
      allowTags: ^v(\d+\.)?(\d+\.)?(\*|\d+)$

I've also tried adding updateTargetRevision: false , didn't change anything.
Capture d’écran 2024-11-19 à 16 54 57

@stephaneetje it's not generally a good idea to write back to a branch you subscribe. Unless you've taken great care with other settings, you're asking for a loop.

Similarly, you need to take great care if you want to sync two Apps to the same branch.

If you have Stage A that puts main in state 1 and syncs App A, then Stage A will be expecting App A to be synced to state 1. If Stage B then puts main in state 2 and syncs App B, then Stage B will be expecting A to be synced to state 2. If App A now syncs again for any other reason, Stage A will notice that App A is now synced to state 2, which defies its expectations and will make things appear unhealthy.

Our usual recommendation is that all Stages have their own branch. If you insist on deviating from that, you need to explore other settings that will make that more tenable, for instance, using path filters to prevent Warehouses from picking up changes written by a Stage and/or using updateTargetRevision: true (not false) so that Apps don't point at the head of a branch, but point always at a specific revision. (But I don't generally recommend that, because "what revision should I currently be pointed at" isn't stored anywhere in Git, which makes that scenario "not real gitops."

Stage per branch is the surefire way to avoid problems.

@krancour On kargo-simple repo there are exemples using the main branch, is that outdated ?
https://github.com/akuity/kargo-simple/blob/main/kargo/stages.yaml

On kargo-simple repo there are exemples using the main branch, is that outdated ?

It's not, but it's also not consistent with what we usually tell people... going to have to look into that.

@hiddeco and @jessesuen any insight? Shouldn't our "simple" example probably use Stage-specific branches since that's our common recommendation?

On kargo-simple repo there are exemples using the main branch, is that outdated ?

It's not, but it's also not consistent with what we usually tell people... going to have to look into that.

@hiddeco and @jessesuen any insight? Shouldn't our "simple" example probably use Stage-specific branches since that's our common recommendation?

So you are saying we can't use a monorepo argocd targeting HEAD or main with Kargo ?
Many argocd exemples (including official ones) target HEAD for every apps.
i don't want the repo to have hundreds of branch (apps*envs = ...rapidly growing number + must push changes to each branches of an app for each changes...).

Also, i can't find out why it was working one week ago, this problem appeared few days ago (i couldn't link it with an argocd or kargo upgrade)

I have to add, it is widly said we should NOT use branch per env for gitops : https://codefresh.io/blog/stop-using-branches-deploying-different-gitops-environments/

So you are saying we can't use a monorepo argocd targeting HEAD or main with Kargo ?

tldr; What I am saying is Kargo is merely a tool for implementing user-defined promotion processes. If your process doesn't make sense, Kargo won't save you.

Many argocd exemples (including official ones) target HEAD for every apps.

Multiple apps all tracking the head of a single branch can make perfect sense in a world that doesn't include Kargo. That's a simple world where you are taking direct responsibility for the one branch that is the source of truth for everything.

But things weren't perfect in that world, either.

I'll assume there is some kind of config management tool like Kustomize or Helm at play. If you want to make a change to dev, then qa, then prod, you make a dev-specific change to the config. When you're satisfied with the result, you make a qa-specific change. When you're satisfied with the result, you might apply the same change to prod-specific config -- or you might revert the dev-specific and qa-specific changes and apply the change instead to the common or "base" configuration used by all environments. Three commits to move one change from one end of the pipeline to the other. Hopefully you've automated all that, but it's not easy.

Now if you were asked to return dev to the state it was in last Monday, you might have your work cut out for you. If you're very lucky, you might only have to revert a few dev-specific changes that were made since that time, but if configuration shared by multiple environments also changed in that interval, and because all your Apps are syncing to the head of the same branch, you'll have to figure out how to selectively revert those changes for dev only while not affecting any other environment. That's a highly manual and error-prone exercise.

So before even bringing Kargo into the picture, the cracks in that approach are already starting to show. Let's be honest with ourselves -- gitops has always been tedious.

Continuing to leave Kargo out of this, what would improve these processes? Without saying "branch" let's consider the idea of separate storage of some kind for each environment's complete config (common config + env-specific config).

If you want to make a change to dev, then qa, then prod, you make a single change to the common configuration used by all environments. To get it into dev, you overwrite everything in dev's storage with whatever's at the head of main (or a subset of it, like common config + dev-specific config). To get it into qa, it's the same thing. To get it into prod, again, the same thing.

Now going in the other direction: You are asked to return "dev" to the state it was in last Monday. You grab whatever commit was at the head of main on Monday and apply the same process described above for moving "forward" in order to move "backward" instead. Overwrite dev's storage with whatever was at the head of main last Monday. No other environment is affected and you didn't have to put any effort at all into figuring out how to roll dev back without affecting the other environments.

The key point here is that having dedicated storage of some kind for each environment allows each environment to easily be moved into any new or previous state without any possibility of impact to other environments.

What would be our options now for this environment-specific storage?

  • Argo CD's looking at using OCI for this.
  • If Argo CD supported it, the storage could be any object or block storage. (Who cares? 🤷‍♂️)
  • Git-based options are the most accessible ones at present:
    • Dedicated directories within one branch. This could be a separate branch from main. It could even be in a separate repo.
    • Dedicated branches

Now let's finally insert Kargo into the conversation. Whether moving "forward" or "backward" Kargo takes the desired state (a specific commit from main, for example, as specified by the Freight being promoted) and applies some user-defined process to it. That's typically going to look like cloning the repo, checking out the commit in question, applying some configuration management tool, and writing its results somewhere. So now we're back to grappling with all of the above...

You can write back to whatever branch it was you were subscribed to, but as I said, when you do that, you need to take great care to prevent the formation of a loop. That's on you. Your ability to get stuck in an infinite loop is not a bug in your programming language; it's a flaw in your code.

Coming back to our recommendation of environment-specific branches: Writing to env-specific branches is not so fundamentally different from writing to env-specific directories in one branch, but if anything, it comes with a few perks:

  • Avoid any kind of loop for "free."
  • Every App can sync to the head of the applicable env-specific branch, but you (or Kargo) can verify that the exact revision it is synced to is the exact desired revision.
  • You can apply branch protections. Want to limit who can merge a PR (opened by Kargo) against branches named like *-prod? No problem at all.

I have to add, it is widly said we should NOT use branch per env for gitops : https://codefresh.io/blog/stop-using-branches-deploying-different-gitops-environments/

Apart from the fact that the advice comes from a competitor, I consider this to be a post that has not aged well. I also don't agree with the premise upon which all the arguments are predicated. This looks like what people do with code and does not at all resemble the branches-as-storage model I have described:

branch-per-environment-1-1024x731

i don't want the repo to have hundreds of branch (apps*envs = ...rapidly growing number + must push changes to each branches of an app for each changes...).

Many people feel this way, and in the absence of something like Kargo managing it all for me, I think I might feel that way as well. But when viewing branches as storage, this statement becomes comparable to saying "I don't want too many buckets in my S3 account."

I know this has been a long-winded reply, so thanks for hanging in there. In summary, I never meant to promote the notion that Kargo can't work as you had been hoping to use it, but it's only a tool for implementing user-defined promotion processes and it won't save you from any flaws inherent in those processes. If anything, it may force previously undetected process flaws to the surface. The branch per env pattern is one we promote only because in our experience it works particularly well with Kargo (and in general).

@hiddeco and @jessesuen any insight? Shouldn't our "simple" example probably use Stage-specific branches since that's our common recommendation?

I did not want to introduce branch rendering into the simple example (which is arguably a more advanced technique). So as soon as this bug is fixed, I will update kargo-simple to remove desiredCommitFromStep: commit.

I have to add, it is widly said we should NOT use branch per env for gitops : https://codefresh.io/blog/stop-using-branches-deploying-different-gitops-environments/

This is actually talking about a different problem. I agree that no one should be doing git flow with Kubernetes manifests. In other words, don't manually maintain manifests in an env-specific branches, and merge changes to and from main and env branches, which is what I think the blog is also against.

Kent explained it in detail, but what we propose is different: env branches should always be a direct reflection of main, just in it's fully rendered form. No human should ever be responsible for updating these env branch, but rather automation (e.g. kargo, kargo-render). If the env specific branch is nothing but just a rendered output of main, then there is no "drift" to be concerned about. There are other benefits to this pattern (branch protection, eliminating obfuscation, argo cd performance), which you can read about here: https://akuity.io/blog/the-rendered-manifests-pattern. But it's not just us, argo cd hydrator feature is another implementation of the pattern: argoproj/argo-cd#17755.

In any case, using rendered branches is not for everyone, and so it's important that we fix this bug so that Kargo will work for Argo CD users who deploy everything from HEAD.

@hiddeco and @jessesuen any insight? Shouldn't our "simple" example probably use Stage-specific branches since that's our common recommendation?

I did not want to introduce branch rendering into the simple example (which is arguably a more advanced technique). So as soon as this bug is fixed, I will update kargo-simple to remove desiredCommitFromStep: commit.

I have to add, it is widly said we should NOT use branch per env for gitops : https://codefresh.io/blog/stop-using-branches-deploying-different-gitops-environments/

This is actually talking about a different problem. I agree that no one should be doing git flow with Kubernetes manifests. In other words, don't manually maintain manifests in an env-specific branches, and merge changes to and from main and env branches, which is what I think the blog is also against.

Kent explained it in detail, but what we propose is different: env branches should always be a direct reflection of main, just in it's fully rendered form. No human should ever be responsible for updating these env branch, but rather automation (e.g. kargo, kargo-render). If the env specific branch is nothing but just a rendered output of main, then there is no "drift" to be concerned about. There are other benefits to this pattern (branch protection, eliminating obfuscation, argo cd performance), which you can read about here: https://akuity.io/blog/the-rendered-manifests-pattern. But it's not just us, argo cd hydrator feature is another implementation of the pattern: argoproj/argo-cd#17755.

In any case, using rendered branches is not for everyone, and so it's important that we fix this bug so that Kargo will work for Argo CD users who deploy everything from HEAD.

Thanks for the detailed replies.

The rendered pattern isn't fitting for us.
Also, you seem to successfully use Kargo targeting HEAD (while i don't). You say it loops only when desiredCommitFromStep isn't filled, but i get the loop either it is set or not. I would like to understand what you are doing that i'm not to make it work.

We just want a very simple task to be executed : update image tag in the values (helm-image-update) and then sync that commit. I wouldn't expect that to be that hard.

The rendered pattern isn't fitting for us.

Although they often do go hand-in-hand and complement each other well, pre-rendering to plain YAML and using env-specific branches as a sort of storage are not inseparable and I think this is too often overlooked.

Your promotion process can simply copy the latest common + env-specific config from main to the env-specific branch.

We just want a very simple task to be executed : update image tag in the values (helm-image-update) and then sync that commit. I wouldn't expect that to be that hard.

It occurs to me that perhaps you're using more of Kargo than you really meant to. You don't need to subscribe your Warehouses to your git repos. You can just subscribe to images. Your Freight would contain only images and not commits. Moving such Freight through the pipeline involves writing to your git repo (main if you want!), but because you're not also subscribing to the repo, there is no worry of accidentally creating a loop. Further, with no commit information in the Freight, it would also prevent the argocd-update step (or the health checks they create) from having any notion that Apps ought to be observably synced to some specific commit.

In the configuration I've just described, you'd be using Kargo in a manner akin to "Argo Image Update on Steroids." What you've attempted up to this point, by comparison, is a much more sophisticated setup than I think you had likely intended.

The rendered pattern isn't fitting for us.

Although they often do go hand-in-hand and complement each other well, pre-rendering to plain YAML and using env-specific branches as a sort of storage are not inseparable and I think this is too often overlooked.

Your promotion process can simply copy the latest common + env-specific config from main to the env-specific branch.

We just want a very simple task to be executed : update image tag in the values (helm-image-update) and then sync that commit. I wouldn't expect that to be that hard.

It occurs to me that perhaps you're using more of Kargo than you really meant to. You don't need to subscribe your Warehouses to your git repos. You can just subscribe to images. Your Freight would contain only images and not commits. Moving such Freight through the pipeline involves writing to your git repo (main if you want!), but because you're not also subscribing to the repo, there is no worry of accidentally creating a loop. Further, with no commit information in the Freight, it would also prevent the argocd-update step (or the health checks they create) from having any notion that Apps ought to be observably synced to some specific commit.

In the configuration I've just described, you'd be using Kargo in a manner akin to "Argo Image Update on Steroids." What you've attempted up to this point, by comparison, is a much more sophisticated setup than I think you had likely intended.

I'm not sure i understood correctly. Are you saying i should skip the helm-image-update + git commit, to just argocd-update with helm parameters ? That would not be a solution as i still need it to be pushed on the repo (need the repo to be the source of truth) to avoid outofsync...
Furthermore, there is no gitrepos subscribed in my warehouse, so not sure i got it right.

apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "3"
    kargo.akuity.io/color: black
  name: scim-prod
  namespace: scim
spec:
  subscriptions:
  - image:
      imageSelectionStrategy: SemVer
      repoURL: europe-docker.pkg.dev/test-01/test/scim
      discoveryLimit: 10
      allowTags: ^v(\d+\.)?(\d+\.)?(\*|\d+)$

@stephaneetje do what you're doing now, but drop your warehouse's subscription to the git repo.

If you need more assistance, please open a separate discussion. This issue was about a specific and narrowly scoped improvement and we've since gotten quite off topic.

@stephaneetje do what you're doing now, but drop your warehouse's subscription to the git repo.

If you need more assistance, please open a separate discussion. This issue was about a specific and narrowly scoped improvement and we've since gotten quite off topic.

As i said, there is no git repo subscription in my warehouse, only image repo (i pasted the warehouse in the previous answer)

Edit: Opened #2968