Wait for a CRD type to deploy before deploying resources that use the type
ringerc opened this issue · 12 comments
Presently, when kubectl is used to apply a large manifest that defines new custom resource definitions (CRDs) as well as resources that use the new resource kind, conditions can cause the deployment to fail. Assuming you're using kubectl apply -f -
and an external kustomize
you might see an error like:
unable to recognize "STDIN": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"
(the exact resource "kind" and api "version" will vary depending on what you're deplying).
This appears to be a race between the k8s cluster applying the new CRD types and kustomize sending requests that use the new types, but there's no indication of that in the command's output. It's confusing for new users, and it's a hassle operationally since deployments will fail then work when re-tried. This is something the kubectl
tool could help users with.
The --server-side
option does not help, as the same race occurs then. And --wait=true
only affects resource removal, not creation.
This can often be reproduced with a kind cluster, though it varies since it's a race. For example:
kind create cluster
git clone -b v0.8.0 https://github.com/prometheus-operator/kube-prometheus
kubectl apply -k kube-prometheus/
... which will often fail with:
daemonset.apps/node-exporter created
unable to recognize "kube-prometheus/": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "Prometheus" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
but when the same command is repeated, it will succeed:
daemonset.apps/node-exporter unchanged
alertmanager.monitoring.coreos.com/main created
prometheus.monitoring.coreos.com/k8s created
prometheusrule.monitoring.coreos.com/alertmanager-main-rules created
prometheusrule.monitoring.coreos.com/kube-prometheus-rules created
...
There doesn't seem to be any (obvious) kubectl flag to impose a delay between requests, wait for a new resource to become visible before continuing, or retry a request if it fails because of a server-side error indicating something was missing.
The error message is confusing for new users and definitely does not help. A wording change and some context would help a lot. I raised that separately: #1118
@ringerc: This issue is currently awaiting triage.
SIG CLI takes a lead on issue triage for this repo, but any Kubernetes member can accept issues by applying the triage/accepted
label.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
See related issue discussing retries: kubernetes/kubernetes#5762 (comment)
The server does not issue HTTP status code 429 here, presumably because it doesn't know there's a pending deployment that will create this resource.
See related issue in kube-prometheus - but note that this is far from specific to kube-prometheus, it can affect anything where races exist between resource creation and resource use
A workaround is to use kfilt
to deploy the CRDs first, then wait for them to become visible, then deploy the rest:
kustomize build somedir | kfilt -i kind=CustomResourceDefinition | kubectl apply -f -
kustomize build somedir | kfilt -i kind=CustomResourceDefinition | kubectl wait --for condition=established --timeout=60s -f -
kustomize build somedir | kubectl apply -f -
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Wait for a CRD type to deploy before deploying resources that use the type
This isn't something that we would implement for kubectl but we will investigate which part of apply can provide a better error. We'll handle that in #1118.
/close
@eddiezane: Closing this issue.
In response to this:
Wait for a CRD type to deploy before deploying resources that use the type
This isn't something that we would implement for kubectl but we will investigate which part of apply can provide a better error. We'll handle that in #1118.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@KnVerey do you know if Kustomize has any hooks authors can use similar to how Helm handles CRD installation separately?
@KnVerey do you know if Kustomize has any hooks authors can use similar to how Helm handles CRD installation separately?
It does not. Kustomize is a purely client-side, so it has no deploy-related features.
That's unfortunate, because it means basically every different user who wants robust deployment has to implement repetitive logic like:
kfilt -i kind=CustomResourceDefinition myconfig.yaml | kubectl apply -f -
kfilt -i kind=CustomResourceDefinition myconfig.yaml | kubectl wait --for condition=established --timeout=60s -f -
kfilt -i kind=Namespace myconfig.yaml | kubectl apply -f -
kfilt -i kind=Namespace myconfig.yaml | kubectl wait --for condition=established --timeout=60s -f -
kubectl apply -f myconfig.yaml
It's reasonable to expect the client sending data to kubectl
to ensure that it is ordered correctly with CRDs, then namespaces, then other structure, such that dependencies are sensible.
But it's a pity that it's seemingly not practical for kubectl to ensure the requests apply correctly.
If waiting isn't viable, what about a --retry-delay '1s' --max-retry-count 5
for retrying individual requests?
Thank you so much for the kfilt
example above. I am running into this in another way.
My vagrant provisioning step fails because k3s is installed but did not complete setting up traefik. So sometimes applying my resources failed with
default: error: unable to recognize "/vagrant/traefik_certificate.yaml": no matches for kind "TLSStore" in version "traefik.containo.us/v1alpha1"
So I can now fix this with
$ kubectl wait --for condition=established crd tlsstores.traefik.containo.us
Thanks! I agree that a more generic solution in kubectl would be great.