crossplane/docs

[Web Bug] - Configuring Crossplane with Argo CD

patpicos opened this issue · 3 comments

The health-check provided has a few blindspots when updating resources which were Ready in the past, but then move to a Degraded state.
Here's a sample set of resource conditions which incorrectly returned Healthy. This is because the logic provided shortcuts and returns healthy as soon type: Ready and Status: True.

Example 1 - AWS Nodegroup

  conditions:
    - lastTransitionTime: '2024-10-10T19:44:11Z'
      reason: Available
      status: 'True'
      type: Ready
    - lastTransitionTime: '2024-11-15T14:17:38Z'
      message: >-
        update failed: async update failed: refuse to update the external
        resource because the following update requires replacing it: cannot
        change the value of the argument "capacity_type" from "SPOT" to
        "ON_DEMAND"
      reason: ReconcileError
      status: 'False'
      type: Synced
    - lastTransitionTime: '2024-11-15T14:17:38Z'
      message: >-
        async update failed: refuse to update the external resource because the
        following update requires replacing it: cannot change the value of the
        argument "capacity_type" from "SPOT" to "ON_DEMAND"
      reason: AsyncUpdateFailure
      status: 'False'
      type: LastAsyncOperation

Example 2 - EC2 LaunchTemplate

  conditions:
    - lastTransitionTime: '2024-10-10T16:12:09Z'
      reason: Available
      status: 'True'
      type: Ready
    - lastTransitionTime: '2024-11-18T19:52:41Z'
      message: >-
        cannot patch the managed resource via server-side apply: failed to
        create typed patch object (/; ec2.aws.upbound.io/v1beta1,
        Kind=LaunchTemplate): .spec.forProvider.vpcSecurityGroupIds: element 0:
        associative list without keys has an element that's an explicit null
      reason: ReconcileError
      status: 'False'
      type: Synced
    - lastTransitionTime: '2024-10-10T16:12:07Z'
      reason: Success
      status: 'True'
      type: LastAsyncOperation

To address this, I would recommend not shortcutting, and process each status in chronological order. below is what I've come up with so far. Also, I find that marking resources as Degraded when they are waiting for input from other resources seems incorrect. As such, I have softened the status when
condition.type == "Synced" and condition.status == "False" to include a check on the message "cannot resolve references" (though this may fail if the messages changes in the future...but then it would revert back to the Degraded .

Example 3 - Softening Degraded State

status:
  atProvider: {}
  conditions:
    - lastTransitionTime: '2024-11-26T17:39:54Z'
      message: >-
        cannot resolve references: mg.Spec.ForProvider.FileSystemID: referenced
        field was empty (referenced resource may not yet be ready)
      reason: ReconcileError
      status: 'False'
      type: Synced

Lastly, the health-check does not provide user feedback when a resource is paused. such as:

  Type:                  Synced
  Status:                False
  Reason:                ReconcilePaused

I believe we should return the status of Suspended as per Argo documentation
Suspended - the resource is suspended and waiting for some external event to resume (e.g. suspended CronJob or paused Deployment)

URL: https://docs.crossplane.io/latest/guides/crossplane-with-argo-cd/

local health_status = {}

local function contains (table, val)
  for i, v in ipairs(table) do
    if v == val then
      return true
    end
  end
  return false
end

local function to_timestamp(date_str)
  return os.time({year = string.sub(date_str, 1, 4),
                  month = string.sub(date_str, 6, 7),
                  day = string.sub(date_str, 9, 10),
                  hour = string.sub(date_str, 12, 13),
                  min = string.sub(date_str, 15, 16),
                  sec = string.sub(date_str, 18, 19),
                  isdst = false})
end

local has_no_status = {
  "ProviderConfig",
  "ProviderConfigUsage",
  "Composition",
  "CompositionRevision",
  "DeploymentRuntimeConfig",
  "ControllerConfig",
}

if obj.status == nil or next(obj.status) == nil and contains(has_no_status, obj.kind) then
  health_status.status = "Healthy"
  health_status.message = "Resource is up-to-date."
  return health_status
end

if obj.status == nil or next(obj.status) == nil or obj.status.conditions == nil then
  if obj.kind == "ProviderConfig" and obj.status.users ~= nil then
    health_status.status = "Healthy"
    health_status.message = "Resource is in use."
    return health_status
  end
  return health_status
end

-- Shortcut for resources with atProvider state such as repositories.argocd.crossplane.io
if obj.status.atProvider then
  if obj.status.atProvider.connectionState then
    if obj.status.atProvider.connectionState.status == "Failed" then
      health_status.status = "Degraded"
      health_status.message = obj.status.atProvider.connectionState.message
      return health_status
    end
  end
end

-- Custom sorting function based on the 'lastTransitionTime' field
if obj.status ~= nil and obj.status.conditions then
  table.sort(obj.status.conditions, function(a, b)
    local time_a = to_timestamp(a.lastTransitionTime)
    local time_b = to_timestamp(b.lastTransitionTime)
    return time_a < time_b  -- Sort in ascending order (earliest first)
  end)
end

-- Process all the states in from oldest to newest. (sorted in L26)
for i, condition in ipairs(obj.status.conditions) do
  if condition.type == "LastAsyncOperation" then
    if condition.status == "False" then
      health_status.status = "Degraded"
      health_status.message = condition.message
    end
  end

  if condition.type == "Synced" then
    if condition.status == "False" and string.match(condition.message, "cannot resolve references") then
      health_status.status = "Progressing"
      health_status.message = condition.message
    elseif condition.status == "False" and condition.reason == "ReconcilePaused" then
      health_status.status = "Suspended"
      health_status.message = condition.message
    elseif condition.status == "False" then
      health_status.status = "Degraded"
      health_status.message = condition.message
    end
  end

  if contains({"Ready", "Healthy", "Offered", "Established"}, condition.type) then
    if condition.status == "True" then
      health_status.status = "Healthy"
      health_status.message = "Resource is up-to-date."
    elseif condition.status == "False" and condition.reason == "Creating" then
      health_status.status = "Progressing"
      health_status.message = condition.message
    end
  end
end
return health_status


@negz I would love your feedback on this. We had discussed in a past thread about the health state and transitions of resources

I've added my examples in a repo to help demonstrate. (this is effectively a fork of the ArgoCD repository and cleaned it so it only has my test cases). See the README
https://github.com/patpicos/crossplane-health-checks

I've updated the logic in the original post based on a few edge cases we encountered after deploying against a broader set of resources.

  • More defensive checks
  • Exposing cases that should show the status as Suspended
  • More accurately representing when the resource is progressing or waiting for a previous resource info before progressing. (original health-check on the crossplane documentation was erroneously showing as Degraded right off the bat