kubeflow/manifests

Upgrade to use Kustomize v4+

zijianjoy opened this issue ยท 46 comments

From the README of kubeflow/manifests: https://github.com/kubeflow/manifests#prerequisites, it is a prerequisite to use Kustomize v3.2.0 for building resources. However, Kustomize v4+ is available. Can we upgrade prerequisite to use new version of Kustomize?

One thing that needs to be changed is that: flag name load_restrictor is changed in Kustomize v4+, we need to make the following updates in docs and build commands:

Change from --load_restrictor=none to --load-restrictor LoadRestrictionsNone. Note the difference between _ and -.

@zijianjoy thanks a lot for opening this issue to switch our target version to latest Kustomize v4.
This was not in our plan for 1.3.0, the reason being that we didn't want to change too many things together.
But it can very well be in our plans for the point release or 1.4. Let's use this issue to track the effort.

I am using Kustomize v4.0.5 for https://github.com/kubeflow-onprem/ArgoFlow and everything works as expected. The --load_restrictor flag is also not necessary except for the Katib manifests, which should be fixed and I have a PR open for this already. For the Argo installation I am not using load_restrictor (for Katib I am using my fixed version of the manifests) so this can be seen as proof that this flag is not necessary for deploying Kubeflow.

@zijianjoy One interesting anomaly I did encounter is that running kustomize build github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-multi-user-pns works for kustomize 3.2.1 and 4.0.5, but not for 3.9.4 (all the other manifests did not have any issue with 3.9.4, the default for Argo CD v2). There was a breaking change in how remote bases are grabbed in kustomize between v3 and v4 (s3 isn't supported anymore), so this is something to potentially keep in mind (not that I think we should update to an outdated version though).

This issue burns me over and over. I am used to installing the latest version of tools and I ran into this again trying to work out the latest release configs for Azure.

As far as I can tell 3.9.2+ are broken with Knative also:
kustomize build --load_restrictor=none common/knative/knative-eventing-install/base

errors with

... at path 'spec/template/metadata/labels': wrong Node Kind for spec.template.metadata.labels expected: MappingNode was AliasNode: value: {*ref_0}

The latest version that works for me is 3.9.1 or 3.8.10. I installed and tested about 20 different versions.

FYI @davidspek with 4.0.5 Knative is also broken as is:

 ~/src/kubeflow/manifests-1.3-branch:  kustomize build --load-restrictor LoadRestrictionsNone common/knative/knative-eventing-install/base
wrong Node Kind for spec.template.metadata.labels expected: MappingNode was AliasNode: value: {*ref_0}

@yanniszark
I really think this is a release blocking issue because neither the latest 3.X version nor any 4.X versions of kustomize work. On the one hand we are telling people that kfdef is no longer the recommend tool, instead use kustomize, on the other hand we don't support any modern versions of kustomize.

@yanniszark with the PR to remove pointers from the Knative YAML everything seems to be working fine.

I can generate everything with v4.0.5 by running from the example folder you provided:

kustomize build --load-restrictor LoadRestrictionsNone .

@berndverst You are correct. I had forgotten about this one but I had already created a PR to fix this in the KNative manifests 4 days ago.

#1795

Also, you don't need the --load-restrictor LoadRestrictionsNone flag so you shouldn't be using it as it disables a security check in kustomize.

@davidspek I noticed I apparently created 2 PRs that were dupes of yours when they were marked as dupes this morning sigh. At least we are getting those issues addressed now!

@davidspek I found I needed the load restrictor flag for katib when I tried this yesterday. But I saw that this is also being addressed somewhere.

@berndverst Indeed, Katig should be the only one. I am working with @andreyvelich to remove this requirement in kubeflow/katib#1498.

On my branch with these fixes you can run:
kustomize build github.com/davidspek/katib/manifests/v1beta1/installs/katib-with-kubeflow-cert-manager?ref=fix-manifests
And it will build the manifests directly, no need to download them locally.

Bobgy commented

I want to add another data point that sticking to kustomize 3.2.0 is not a good idea.

See kubernetes-sigs/kustomize#1500 (comment), starting from kubectl release 1.21, it will come with kustomize v3.9. In not far future, people will be able to use a relatively new version of kustomize from kubectl. That can be an onboarding experience improvement if kubectl is the only tool people need.

I think the urgency isn't so much considering the speed people upgrade kubernetes clusters, it could be enough that we start supporting later kustomize versions starting from Kubeflow 1.4.

Bobgy commented

@berndverst You are correct. I had forgotten about this one but I had already created a PR to fix this in the KNative manifests 4 days ago.

#1795

Also, you don't need the --load-restrictor LoadRestrictionsNone flag so you shouldn't be using it as it disables a security check in kustomize.

@davidspek I want to remind you why we decided to use load restrictor none, read https://bit.ly/kf_kustomize_v3. It was introduced mainly because we want people to build reusable patches that people can compose, instead of relying on overlays (which is inheritance model and people cannot compose).

So far, I think https://kubectl.docs.kubernetes.io/guides/config_management/components/ is a feature designed to solve this problem -- allow reusable composition of kustomization transformation without turning off security check.

but I haven't used it much and I don't see much of the community using it, I wonder how people think about it.

@Bobgy I understand why the load-restrictor was useful, but currently only 1 component still needs it (and those manifests are better suited with overlays if you ask me). Indeed the components feature is something I ran into as well, but this cannot be used at this point because for some reason kustomize 3.2.1 is being used (the load restrictor can be using with newer version, and all the manifests build with newer versions).
The only reason the version of kustomize matters at this point is because the load-restrictor flag has changed in newer versions, and this flag is being used for all the manifests even though only the Katib manifests needs it (for now). If the load-restrictor flag is removed from the instructions, the kustomize version can also be removed.

Bobgy commented

Good point! Then I agree it's very useful fixing katib and let everyone stop using the flag, that makes manifest compatible between v3 and v4.

@Bobgy For reference, here is the PR for updating the Katib manifests: kubeflow/katib#1498.

@berndverst we have merged all linked PRs necessary in order to use kustomize 4.0.5.
However, I have bumped into what seems a blocking issue in latest kustomize.
The latest kustomize version orders admission webhooks (Validating/MutatingWebhookConfiguration) last.
This means that the installation is now prone to the following race condition:

  1. Pods are created before the istio injection webhook, thus they don't get a sidecar.
  2. Apps that integrate with CentralDashboard like KFP UI fail because they don't have a sidecar.

I tested the current example kustomization and it actually failed. Did you also encounter this problem?
I will open an upstream kustomize issue on Monday.

Bobgy commented

@yanniszark I think we should better break up installation to several steps to avoid the race condition in the first place. There is no guarantee how long each webhook server takes to start either.

I've deployed the manifests manually with Kustomize 4.0.5 as well and have only run into the chicken & egg situation a few times and rerunning the command (instantly) like what is stated in the current instructions solved it.

@yanniszark It looks like there is an existing discussion about the ordering, see kubernetes-sigs/kustomize#821. In particular this comment and the following one are about the ordering of MutatingWebhookConfigurations.

@yanniszark I think we should better break up installation to several steps to avoid the race condition in the first place. There is no guarantee how long each webhook server takes to start either.

True, but this is not a problem with the current single command installation right? Because if a webhook Pod is not up yet, then kubectl exits with non-zero code and is retried.

I've deployed the manifests manually with Kustomize 4.0.5 as well and have only run into the chicken & egg situation a few times and rerunning the command (instantly) like what is stated in the current instructions solved it.

@davidspek you are referring to the scenario described by @Bobgy right? That is, trying to create an object while its registered webhook Service does not have any ready Pods yet.

In the scenario I described, I believe retrying won't help, as the Pods need to be recreated in order to be injected with the Istio sidecar.

@yanniszark It looks like there is an existing discussion about the ordering, see kubernetes-sigs/kustomize#821. In particular this comment and the following one are about the ordering of MutatingWebhookConfigurations.

Yeap! I am currently going through the relevant kustomize PRs / issues and I will open an upstream issue to discuss.

@Bobgy @yanniszark The PR to make Katibโ€™s manifests not need the load-restrictor (and some other enhancements/optimizations) has just been merged: kubeflow/katib#1498. It does still need to be picked into the current release branch. After this is done, the load-restrictor shouldnโ€™t be needed for any of the Kubeflow manifests anymore and v3/v4 should both work without issue. Iโ€™ve been deploying everything with 4.0.5 and without the load-restrictor flag.

Bobgy commented

@yanniszark I think we are talking about the same problem, if istio sidecars are not injected, retry doesn't help.

So I suggest separating steps as necessary -- e.g. Istio setup should finish before anything may depend on it. It's ok to not separate steps for race conditions that can be resolved via retry.

Right now, it seems we need two steps:

  1. apply istio & its dependencies
  2. apply other stuff

@Bobgy for this:

I think we should better break up installation to several steps to avoid the race condition in the first place

We already have the step-by-step instructions. From the later message, I see that you are talking about just 2 steps, instead of a full step-by-step installation. This is definitely a possible workaround, but there might still be some webhook configurations that are skipped because of kustomize's new ordering, so we need to look into that carefully if we decide to go down this way.

In addition, I have opened an upstream issue in kustomize and I'm testing a fix:
kubernetes-sigs/kustomize#3794

Just for reference, I've not seen this issue with the Istio sidecars not being injected and causing problems when deploying with Argo CD. I'm not sure if this is just luck, or if Argo CD is handling this is some way behind the scenes.

Are manifests going to revert to referencing the 'upstream' manifests through URL's for the next release or is this planned for further in the future? This would drastically reduce the overhead of pulling in all the upstream manifests into this repository, it would simply be a link

Are manifests going to revert to referencing the 'upstream' manifests through URL's for the next release or is this planned for further in the future? This would drastically reduce the overhead of pulling in all the upstream manifests into this repository, it would simply be a link

That is something that should be discussed for after the 1.3 release. @connorlwilkes would you like to open an issue to propose this change and the benefits / drawbacks? cc @davidspek who has also used this method in Argoflow.
Since this issue is for using kustomize v4 and the issues blocking us from doing so, let's spin this discussion in a separate issue.

@yanniszark @connorlwilkes After the release I was also planning on starting this discussion. I think this will be particularly useful once I have Renovate properly configured for everything, as the only upkeep would then be merge auto-created PRs. Distributions would then simply for the repo and add their customizations to it, similarly to how people are using my Argo installation now. But indeed, this is off the topic of this issue and should be discussed somewhere else.

@yanniszark I've just done a deployment with Argo CD on a fresh Equinix Metal cluster and Istio was failing to deploy as I hadn't set the necessary api-server flags. I see the following event with all the components related to the above conversation sidecar injection and their deployments are failing because of this, but after Istio comes up the deployments come up as well:

Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": Post "https://istiod.istio-system.svc:443/inject?timeout=30s": dial tcp 10.98.192.83:443: connect: connection refused

@davidspek thanks for the report. I don't think it's relevant to the issue described here (#1797 (comment)) right?

@berndverst I wanted to ask, since you mentioned in your email that:

With that change the latest kustomize 4.0.5 seems to be working for me

Did you successfully deploy with kustomize 4.0.5 with the 1-command install and check that the web apps all work? I would have expected you to bump into kubernetes-sigs/kustomize#3794

@yanniszark The load_restrictor flag should probably be removed from the README now that the Katib manifests PR has been merged.

@yanniszark The load_restrictor flag should probably be removed from the README now that the Katib manifests PR has been merged.

Thanks! Want to create a quick PR? I can review/merge.

@yanniszark Sure, I'll have it done in a moment.

I am seeing an issue with: https://github.com/kubeflow/pipelines/tree/1.5.0/manifests/kustomize/env/gcp/minio-gcs-gateway which is pulled in as upstream in pipeline repository with kustomize 4.x in that the "env" flag is now deprecated for "envs". You can see more detail here: kubernetes-sigs/kustomize#1069. I am sure this might eb the case throughout this manifests repository

@connorlwilkes Would you like to create a PR to fix that in the pipelines repo?

@connorlwilkes please file an issue in the kubeflow/pipelines repo. I think that pipelines does that to be compatible with kubectl.
Latest kustomize versions also support env, so it is not a blocking issue for moving to latest kustomize.

Next Steps

After talking with the Kustomize team and presenting our problems, we have reached a conclusion.
We will implement a feature in the kustomization.yaml file that allows the order to be customized:
kubernetes-sigs/kustomize#3913

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

@yanniszark AFAIU your contribution to kustomization ordering is in its final steps. When could we expect that kubeflow manifests support kustomize v4+?

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

@yanniszark any updates?

When could we expect that kubeflow manifests support kustomize v4+?

Hey folks, just a quick update that the PR to introduce support for custom ordering in kustomize is in the final review stage: kubernetes-sigs/kustomize#4019
Thanks for your patience :)

Kustomize PR is merged: kubernetes-sigs/kustomize#4019
The new functionality should be coming in the next kustomize release!

@yanniszark is there an easy way to enable the ordering in kustomize 4 and who is going to change the install instructions?

@juliusvonkohout the way to customize the order is through a new sortOptions field in kustomization.yaml.
You can see an example here:
https://github.com/kubernetes-sigs/kustomize/blob/71eb865cea2646ccf5e4bdeeb555f85c6604b537/api/krusty/sortordertransformer_test.go#L113-L144

This would go in the example kustomization. I think @kimwnasptd will coordinate the effort, it should be a straightforward PR.

We are now at kustomize 5

/close

There has been no activity for a long time. Please reopen if necessary.

@juliusvonkohout: Closing this issue.

In response to this:

We are now at kustomize 5

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.