kubernetes/kubectl

Wait for a CRD type to deploy before deploying resources that use the type

ringerc opened this issue · 12 comments

Presently, when kubectl is used to apply a large manifest that defines new custom resource definitions (CRDs) as well as resources that use the new resource kind, conditions can cause the deployment to fail. Assuming you're using kubectl apply -f - and an external kustomize you might see an error like:

unable to recognize "STDIN": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"

(the exact resource "kind" and api "version" will vary depending on what you're deplying).

This appears to be a race between the k8s cluster applying the new CRD types and kustomize sending requests that use the new types, but there's no indication of that in the command's output. It's confusing for new users, and it's a hassle operationally since deployments will fail then work when re-tried. This is something the kubectl tool could help users with.

The --server-side option does not help, as the same race occurs then. And --wait=true only affects resource removal, not creation.

This can often be reproduced with a kind cluster, though it varies since it's a race. For example:

kind create cluster
git clone -b v0.8.0 https://github.com/prometheus-operator/kube-prometheus
kubectl apply -k kube-prometheus/

... which will often fail with:

daemonset.apps/node-exporter created
unable to recognize "kube-prometheus/": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "Prometheus" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
unable to recognize "kube-prometheus/": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"

but when the same command is repeated, it will succeed:

daemonset.apps/node-exporter unchanged
alertmanager.monitoring.coreos.com/main created
prometheus.monitoring.coreos.com/k8s created
prometheusrule.monitoring.coreos.com/alertmanager-main-rules created
prometheusrule.monitoring.coreos.com/kube-prometheus-rules created
...

There doesn't seem to be any (obvious) kubectl flag to impose a delay between requests, wait for a new resource to become visible before continuing, or retry a request if it fails because of a server-side error indicating something was missing.

The error message is confusing for new users and definitely does not help. A wording change and some context would help a lot. I raised that separately: #1118

@ringerc: This issue is currently awaiting triage.

SIG CLI takes a lead on issue triage for this repo, but any Kubernetes member can accept issues by applying the triage/accepted label.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

See related issue discussing retries: kubernetes/kubernetes#5762 (comment)

The server does not issue HTTP status code 429 here, presumably because it doesn't know there's a pending deployment that will create this resource.

See related issue in kube-prometheus - but note that this is far from specific to kube-prometheus, it can affect anything where races exist between resource creation and resource use

prometheus-operator/prometheus-operator#1866

A workaround is to use kfilt to deploy the CRDs first, then wait for them to become visible, then deploy the rest:

kustomize build somedir | kfilt -i kind=CustomResourceDefinition | kubectl apply -f -
kustomize build somedir | kfilt -i kind=CustomResourceDefinition | kubectl wait --for condition=established --timeout=60s -f -
kustomize build somedir | kubectl apply -f - 

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Wait for a CRD type to deploy before deploying resources that use the type

This isn't something that we would implement for kubectl but we will investigate which part of apply can provide a better error. We'll handle that in #1118.

/close

@eddiezane: Closing this issue.

In response to this:

Wait for a CRD type to deploy before deploying resources that use the type

This isn't something that we would implement for kubectl but we will investigate which part of apply can provide a better error. We'll handle that in #1118.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@KnVerey do you know if Kustomize has any hooks authors can use similar to how Helm handles CRD installation separately?

@KnVerey do you know if Kustomize has any hooks authors can use similar to how Helm handles CRD installation separately?

It does not. Kustomize is a purely client-side, so it has no deploy-related features.

That's unfortunate, because it means basically every different user who wants robust deployment has to implement repetitive logic like:

kfilt -i kind=CustomResourceDefinition myconfig.yaml | kubectl apply -f -
kfilt -i kind=CustomResourceDefinition myconfig.yaml | kubectl wait --for condition=established --timeout=60s -f -
kfilt -i kind=Namespace myconfig.yaml | kubectl apply -f -
kfilt -i kind=Namespace myconfig.yaml | kubectl wait --for condition=established --timeout=60s -f -
kubectl apply -f myconfig.yaml

It's reasonable to expect the client sending data to kubectl to ensure that it is ordered correctly with CRDs, then namespaces, then other structure, such that dependencies are sensible.

But it's a pity that it's seemingly not practical for kubectl to ensure the requests apply correctly.

If waiting isn't viable, what about a --retry-delay '1s' --max-retry-count 5 for retrying individual requests?

Thank you so much for the kfilt example above. I am running into this in another way.

My vagrant provisioning step fails because k3s is installed but did not complete setting up traefik. So sometimes applying my resources failed with

default: error: unable to recognize "/vagrant/traefik_certificate.yaml": no matches for kind "TLSStore" in version "traefik.containo.us/v1alpha1"

So I can now fix this with

$ kubectl wait --for condition=established crd tlsstores.traefik.containo.us

Thanks! I agree that a more generic solution in kubectl would be great.