[Web Bug] - Configuring Crossplane with Argo CD
patpicos opened this issue · 3 comments
The health-check provided has a few blindspots when updating resources which were Ready
in the past, but then move to a Degraded
state.
Here's a sample set of resource conditions which incorrectly returned Healthy
. This is because the logic provided shortcuts and returns healthy as soon type: Ready
and Status: True
.
Example 1 - AWS Nodegroup
conditions:
- lastTransitionTime: '2024-10-10T19:44:11Z'
reason: Available
status: 'True'
type: Ready
- lastTransitionTime: '2024-11-15T14:17:38Z'
message: >-
update failed: async update failed: refuse to update the external
resource because the following update requires replacing it: cannot
change the value of the argument "capacity_type" from "SPOT" to
"ON_DEMAND"
reason: ReconcileError
status: 'False'
type: Synced
- lastTransitionTime: '2024-11-15T14:17:38Z'
message: >-
async update failed: refuse to update the external resource because the
following update requires replacing it: cannot change the value of the
argument "capacity_type" from "SPOT" to "ON_DEMAND"
reason: AsyncUpdateFailure
status: 'False'
type: LastAsyncOperation
Example 2 - EC2 LaunchTemplate
conditions:
- lastTransitionTime: '2024-10-10T16:12:09Z'
reason: Available
status: 'True'
type: Ready
- lastTransitionTime: '2024-11-18T19:52:41Z'
message: >-
cannot patch the managed resource via server-side apply: failed to
create typed patch object (/; ec2.aws.upbound.io/v1beta1,
Kind=LaunchTemplate): .spec.forProvider.vpcSecurityGroupIds: element 0:
associative list without keys has an element that's an explicit null
reason: ReconcileError
status: 'False'
type: Synced
- lastTransitionTime: '2024-10-10T16:12:07Z'
reason: Success
status: 'True'
type: LastAsyncOperation
To address this, I would recommend not shortcutting, and process each status in chronological order. below is what I've come up with so far. Also, I find that marking resources as Degraded
when they are waiting for input from other resources seems incorrect. As such, I have softened the status when
condition.type == "Synced"
and condition.status == "False"
to include a check on the message "cannot resolve references"
(though this may fail if the messages changes in the future...but then it would revert back to the Degraded
.
Example 3 - Softening Degraded State
status:
atProvider: {}
conditions:
- lastTransitionTime: '2024-11-26T17:39:54Z'
message: >-
cannot resolve references: mg.Spec.ForProvider.FileSystemID: referenced
field was empty (referenced resource may not yet be ready)
reason: ReconcileError
status: 'False'
type: Synced
Lastly, the health-check does not provide user feedback when a resource is paused. such as:
Type: Synced
Status: False
Reason: ReconcilePaused
I believe we should return the status of Suspended
as per Argo documentation
Suspended - the resource is suspended and waiting for some external event to resume (e.g. suspended CronJob or paused Deployment)
URL: https://docs.crossplane.io/latest/guides/crossplane-with-argo-cd/
local health_status = {}
local function contains (table, val)
for i, v in ipairs(table) do
if v == val then
return true
end
end
return false
end
local function to_timestamp(date_str)
return os.time({year = string.sub(date_str, 1, 4),
month = string.sub(date_str, 6, 7),
day = string.sub(date_str, 9, 10),
hour = string.sub(date_str, 12, 13),
min = string.sub(date_str, 15, 16),
sec = string.sub(date_str, 18, 19),
isdst = false})
end
local has_no_status = {
"ProviderConfig",
"ProviderConfigUsage",
"Composition",
"CompositionRevision",
"DeploymentRuntimeConfig",
"ControllerConfig",
}
if obj.status == nil or next(obj.status) == nil and contains(has_no_status, obj.kind) then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
return health_status
end
if obj.status == nil or next(obj.status) == nil or obj.status.conditions == nil then
if obj.kind == "ProviderConfig" and obj.status.users ~= nil then
health_status.status = "Healthy"
health_status.message = "Resource is in use."
return health_status
end
return health_status
end
-- Shortcut for resources with atProvider state such as repositories.argocd.crossplane.io
if obj.status.atProvider then
if obj.status.atProvider.connectionState then
if obj.status.atProvider.connectionState.status == "Failed" then
health_status.status = "Degraded"
health_status.message = obj.status.atProvider.connectionState.message
return health_status
end
end
end
-- Custom sorting function based on the 'lastTransitionTime' field
if obj.status ~= nil and obj.status.conditions then
table.sort(obj.status.conditions, function(a, b)
local time_a = to_timestamp(a.lastTransitionTime)
local time_b = to_timestamp(b.lastTransitionTime)
return time_a < time_b -- Sort in ascending order (earliest first)
end)
end
-- Process all the states in from oldest to newest. (sorted in L26)
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "LastAsyncOperation" then
if condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
end
end
if condition.type == "Synced" then
if condition.status == "False" and string.match(condition.message, "cannot resolve references") then
health_status.status = "Progressing"
health_status.message = condition.message
elseif condition.status == "False" and condition.reason == "ReconcilePaused" then
health_status.status = "Suspended"
health_status.message = condition.message
elseif condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
end
end
if contains({"Ready", "Healthy", "Offered", "Established"}, condition.type) then
if condition.status == "True" then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
elseif condition.status == "False" and condition.reason == "Creating" then
health_status.status = "Progressing"
health_status.message = condition.message
end
end
end
return health_status
@negz I would love your feedback on this. We had discussed in a past thread about the health state and transitions of resources
I've added my examples in a repo to help demonstrate. (this is effectively a fork of the ArgoCD repository and cleaned it so it only has my test cases). See the README
https://github.com/patpicos/crossplane-health-checks
I've updated the logic in the original post based on a few edge cases we encountered after deploying against a broader set of resources.
- More defensive checks
- Exposing cases that should show the status as
Suspended
- More accurately representing when the resource is progressing or waiting for a previous resource info before progressing. (original health-check on the crossplane documentation was erroneously showing as
Degraded
right off the bat