ray-project/kuberay

[Bug] Issues with RayCluster CRD and kubectl apply

DmitriGekhtman opened this issue ยท 11 comments

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

kubectl apply -k manifests/cluster-scope-resources yields the error
The CustomResourceDefinition "rayclusters.ray.io" is invalid: metadata.annotations: Too long: must have at most 262144 bytes.

Reason:
After re-generating the KubeRay CRD in #268, some pod template fields from recent versions of K8s were generated. Now the CRD is too big to fit in the metadata.lastAppliedConfiguration field used by kubectl apply.

The solution I'd propose is to move the CRD out of the kustomization file and advise users to kubectl create the CRD before installing the rest of the cluster-scoped resources.

Reproduction script

See above.

Anything else

After running kubectl apply -k, I tried to kubectl delete -k so that I could subsequently kubectl create -k.
Unfortunately, my ray-system namespace is hanging in a terminating state!
edit: My ray-system namespace is hanging simply because cluster is 100% borked.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

In case this helps other people using ArgoCD to deploy KubeRay, we solved this issue using a Kustomization and patching the RayCluster CRD with the annotation argocd.argoproj.io/sync-options: Replace=true to make ArgoCD use kubectl replace instead of kubectl apply when syncing this particular resource:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - https://github.com/ray-project/kuberay/manifests/cluster-scope-resources/?ref=master
  - https://github.com/ray-project/kuberay/manifests/base/?ref=master

patchesStrategicMerge:
  # CRD rayclusters.ray.io manifest is too big to fit in the
  # annotation `kubectl.kubernetes.io/last-applied-configuration`
  # added by `kubectl apply` used by ArgoCD, and so it fails
  # https://github.com/ray-project/kuberay/issues/271
  # Annotate this CRD to make ArgoCD use `kubectl replace` and avoid the error when syncing it
  - |-
    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      name: rayclusters.ray.io
      annotations:
        argocd.argoproj.io/sync-options: Replace=true

I have the same issue.

We'll start by replacing "apply" in the docs with "create". Then we'll look into shrinking the CRD.
It seems this bug comes up from time to time in various K8s projects...

Also to extend this, we should have status and restarts about running clusters ?

$ kubectl get rayclusters 
NAME                  AGE
raycluster-complete   7m48s

it used to be

$ kubectl -n ray get rayclusters
NAME              STATUS    RESTARTS   AGE
example-cluster   Running   0          53s

Status could make sense -- it would simply indicate the status of the head pod.
Restarts are a bit flimsier as a notion because we don't quite have a coherent notion of what constitutes a restart -- I guess that would mean the number of head container restarts + the number of head pod replacements.

We could potentially take a look at what the K8s deployment controller does.

Try using kubectl apply --server-side

For the Argo CD users, maybe we can add some instructions into the document?
Just like I did for the Flink operator project
https://github.com/apache/flink-kubernetes-operator/blob/main/docs/content/docs/operations/helm.md#working-with-argo-cd

For the Argo CD users, maybe we can add some instructions into the document? Just like I did for the Flink operator project https://github.com/apache/flink-kubernetes-operator/blob/main/docs/content/docs/operations/helm.md#working-with-argo-cd

@haoxins
That sounds good.
If you have a working set-up with Argo CD / Helm / KubeRay, feel free to open a PR adding the relevant info to the README!
https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/README.md

For the Argo CD users, maybe we can add some instructions into the document? Just like I did for the Flink operator project https://github.com/apache/flink-kubernetes-operator/blob/main/docs/content/docs/operations/helm.md#working-with-argo-cd

@haoxins That sounds good. If you have a working set-up with Argo CD / Helm / KubeRay, feel free to open a PR adding the relevant info to the README! https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/README.md

#535

We could update the docs to mention that kubectl apply --server-side works.

I think for the moment, the only actionable item is the documentation item described in the last comment.
Going to remove the 0.4.0 milestone label from this issue because docs are not currently versioned.