kubernetes/kubernetes

containerized protobuf codegen does not handle .go-version / GOTOOLCHAIN properly

liggitt opened this issue · 19 comments

Seen in #124922

+++ command: bash "hack/make-rules/../../hack/verify-codegen.sh"
go version go1.22.3 linux/amd64
+++ [0517 06:22:42] Generating protobufs for 67 targets
+++ [0517 06:22:42] protoc 23.4 not found (can install with hack/install-protoc.sh); generating containerized...
+++ [0517 06:22:42] Verifying Prerequisites....
+++ [0517 06:22:43] Building Docker image kube-build:build-71fc6bd7d6-5-v1.28.0-go1.21.10-bullseye.0
+++ [0517 06:24:57] Creating data container kube-build-data-71fc6bd7d6-5-v1.28.0-go1.21.10-bullseye.0
+++ [0517 06:24:58] Syncing sources to container
+++ [0517 06:25:03] Output from this container will be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=n to disable.
+++ [0517 06:25:03] Running build command...
go: downloading go1.22.3 (linux/amd64)
go: download go1.22.3 for linux/amd64: toolchain not available
!!! [0517 06:25:04] Call tree:
!!! [0517 06:25:04]  1: build/../build/common.sh:489 kube::build::run_build_command_ex(...)
!!! [0517 06:25:04]  2: build/run.sh:39 kube::build::run_build_command(...)
!!! [0517 06:25:04] Call tree:
!!! [0517 06:25:04]  1: hack/update-codegen.sh:919 codegen::protobuf(...)
+++ exit code: 1
+++ error: 1
�[0;31mFAILED�[0m   verify-codegen.sh	148s

Something is not working properly between .go-version → GOTOOLCHAIN when the go version inside this specific container doesn't match

/assign
/cc @BenTheElder @MadhavJivrajani

/kind bug

/triage accepted
/sig architecture testing
/area code-organization

go: download go1.22.3 for linux/amd64: toolchain not available

This is an error I've seen when we use go 1.x and not go 1.x.y as the go directive, some instances: golang/go#62278, dependabot/dependabot-core#7895

Which is why what you said here makes sense: #124922 (comment). Once the actual bump happens, GOTOOLCHAIN env variable that will propagate into the container will match with the toolchain that exists already and go will not try and download it again leading to the error (similar to the issues linked).

This seems to also be the reason the verify job failed here: #122410
When I had started the testing, master was still not on go1.22 leading to the kube-build go toolchain mismatch.

I wonder if it would be a good idea to substitute the value of .go-version directly in the kube-build script

But I guess the long term solution would be to incorporate the toolchain directive into our go mods and the tooling.

But I guess the long term solution would be to incorporate the toolchain directive into our go mods and the tooling.

This also has the benefit of enabling IDEs to respect it without our makefiles (though your builds still won't match us if you just go build), but we gain additional complexity making these align and we need to continue to respect .go-version / envs properly for now because downstream users depend on this "API"

Even if we switch to toolchain in go.mod we should probably make the env / .go-version work properly for at least one minor while putting a release note that downstreams must switch. (I'm sure an argument could be made for longer).

It would probably be preferable to keep them working permanently because nobody wants to carry go.mod patches but it is common to need to force a different go version for security patching etc.

Do we know yet if this applies to all branches?

Do we know yet if this applies to all branches?

yes, it applies to all branches (we saw this when trying to just bump .go-version to 1.22.x on release branches)

/assign

So, interestingly if I do GOTOOLCHAIN=go1.21.9 make verify WHAT=codegen, we get the same failure.

go: downloading go1.21.9 (linux/amd64)
go: download go1.21.9 for linux/amd64: toolchain not available

Seems like something is broken with GOTOOLCHAIN support in this go install or go, rather than our scripts?

Maybe something with the kube-build image.

Normal go image:

$ docker run --rm -e GOTOOLCHAIN=go1.22.3 golang:1.21.10-bullseye go version
Unable to find image 'golang:1.21.10-bullseye' locally
1.21.10-bullseye: Pulling from library/golang
3d53ef4019fc: Pull complete 
08f0bf643eb6: Pull complete 
6b037c2b46ab: Pull complete 
c9c00c8cd734: Pull complete 
a4de642a3616: Pull complete 
785d32889ff9: Pull complete 
4f4fb700ef54: Pull complete 
Digest: sha256:6815d296f20c3cb3afa67bf5dc862a831b6080f5dd8b5b423bb5f862539299de
Status: Downloaded newer image for golang:1.21.10-bullseye
go: downloading go1.22.3 (linux/amd64)
go version go1.22.3 linux/amd64

kube-cross:

docker run --rm -e GOTOOLCHAIN=go1.22.3 registry.k8s.io/build-image/kube-cross:v1.28.0-go1.21.10-bullseye.0 go version
Unable to find image 'registry.k8s.io/build-image/kube-cross:v1.28.0-go1.21.10-bullseye.0' locally
v1.28.0-go1.21.10-bullseye.0: Pulling from build-image/kube-cross
646e886fa3cf: Pull complete 
c5a360c5f105: Pull complete 
5cbeb8ef1d90: Pull complete 
f3054eaa1068: Pull complete 
fd19ff5befab: Pull complete 
029d7cfe87db: Pull complete 
4f4fb700ef54: Pull complete 
ab038281f51a: Pull complete 
9bc87ef1bf76: Pull complete 
eb4d85f62c54: Pull complete 
00a4d6140bd2: Pull complete 
30d8f173cf20: Pull complete 
1903caad7a82: Pull complete 
e7f153876917: Pull complete 
Digest: sha256:1996e852695eb1d3b48bda278a4584317d5d42eaef226899a331ea37394b6952
Status: Downloaded newer image for registry.k8s.io/build-image/kube-cross:v1.28.0-go1.21.10-bullseye.0
go: downloading go1.22.3 (linux/amd64)
go version go1.22.3 linux/amd64

kube-build:

$ docker run --rm -e GOTOOLCHAIN=go1.22.3 kube-build:build-9a4d0b6c27-5-v1.28.0-go1.21.10-bullseye.0 go version
go: downloading go1.22.3 (linux/amd64)
go version go1.22.3 linux/amd64

but under verify-codegen with GOTOOLCHAIN=go1.22.3 make verify WHAT=codegen:

...
+++ [0520 21:58:12] Building Docker image kube-build:build-17572b7f79-5-v1.28.0-go1.21.10-bullseye.0
+++ [0520 21:58:22] Creating data container kube-build-data-17572b7f79-5-v1.28.0-go1.21.10-bullseye.0
+++ [0520 21:58:23] Syncing sources to container
+++ [0520 21:58:31] Output from this container will be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=n to disable.
+++ [0520 21:58:31] Running build command...
go: downloading go1.22.3 (linux/amd64)
go: download go1.22.3 for linux/amd64: toolchain not available
...

.... hmm

narrowed it down to hack/lib/init.sh:

$ GOTOOLCHAIN=go1.22.3 build/run.sh bash -c 'source hack/lib/init.sh && go version && echo $GOTOOLCHAIN'
+++ [0520 22:09:21] Verifying Prerequisites....
+++ [0520 22:09:22] Building Docker image kube-build:build-b9c550bd56-5-v1.28.0-go1.21.10-bullseye.0
+++ [0520 22:09:31] Syncing sources to container
+++ [0520 22:09:33] Output from this container will be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=n to disable.
+++ [0520 22:09:33] Running build command...
go version go1.21.10 linux/amd64
go1.22.3
+++ [0520 22:09:34] Syncing out of container
$ GOTOOLCHAIN=go1.22.3 build/run.sh bash -c 'go version && echo $GOTOOLCHAIN'
+++ [0520 22:09:54] Verifying Prerequisites....
+++ [0520 22:09:54] Building Docker image kube-build:build-b9c550bd56-5-v1.28.0-go1.21.10-bullseye.0
+++ [0520 22:10:04] Syncing sources to container
+++ [0520 22:10:06] Output from this container will be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=n to disable.
+++ [0520 22:10:06] Running build command...
go: downloading go1.22.3 (linux/amd64)
go version go1.22.3 linux/amd64
go1.22.3
+++ [0520 22:10:13] Syncing out of container

Talked to @liggitt about this, who found GO111MODULE=off suspect, setting this will seemingly cause go version to ignore GOTOOLCHAIN.

even @ 765e7ef (just merged to master an hour ago) we can see:

$ GOTOOLCHAIN=go1.22.1 build/run.sh bash -c 'GO111MODULE=off go version && echo "GOTOOLCHAIN=$GOTOOLCHAIN"'
+++ [0520 23:39:30] Verifying Prerequisites....
+++ [0520 23:39:31] Building Docker image kube-build:build-3ad27617b0-5-v1.31.0-go1.22.3-bullseye.0
+++ [0520 23:41:10] Deleting image kube-build:build-3ad27617b0-5-v1.30.0-go1.22.2-bullseye.0
+++ [0520 23:41:12] Creating data container kube-build-data-3ad27617b0-5-v1.31.0-go1.22.3-bullseye.0
+++ [0520 23:41:23] Syncing sources to container
+++ [0520 23:41:36] Output from this container will be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=n to disable.
+++ [0520 23:41:36] Running build command...
go version go1.22.3 linux/amd64
GOTOOLCHAIN=go1.22.1
+++ [0520 23:41:37] Syncing out of container
$ GOTOOLCHAIN=go1.22.1 build/run.sh bash -c 'GO111MODULE=on go version && echo "GOTOOLCHAIN=$GOTOOLCHAIN"'
+++ [0520 23:41:54] Verifying Prerequisites....
+++ [0520 23:41:55] Building Docker image kube-build:build-3ad27617b0-5-v1.31.0-go1.22.3-bullseye.0
+++ [0520 23:42:04] Syncing sources to container
+++ [0520 23:42:06] Output from this container will be rsynced out upon completion. Set KUBE_RUN_COPY_OUTPUT=n to disable.
+++ [0520 23:42:06] Running build command...
go: downloading go1.22.1 (linux/amd64)
go version go1.22.1 linux/amd64
GOTOOLCHAIN=go1.22.1
+++ [0520 23:42:14] Syncing out of container

So this is probably related.

On master GOTOOLCHAIN=go1.22.1 make verify WHAT=codege is not broken, presumably because we don't set GO111MODULE=off anymore after @thockin's WORKSPACES work.

I think we can say this is already fixed @ HEAD and is just poorly behaved on the older branches.

We should probably hack in a gimme fallback for this case on older branches while doing our best to respect GOTOOLCHAIN after introducing it (IIRC we put a release note about using GOTOOLCHAIN to control go versions). We already use gimme if go is not available or too old.

Might not be worth prioritizing a patch given this is only relevant if you're:

  • trying to override the go version
  • not overriding the build container image to match
  • running on an old release branch

This should be fine in 1.30+ *

*well, there's a smaller bug in #125010

Thanks for digging here @BenTheElder!

Hmm, this is slightly confusing, on darwin:

❯ go version
go version go1.22.2 darwin/amd64

and with GO111MODULE=off

❯ GOTOOLCHAIN=go1.21.9 GO111MODULE=off go version
go version go1.21.0 darwin/amd64

I'm not sure why its go1.21.0, I don't have that toolchain present locally either...

with GO111MODULE=on:

❯ GOTOOLCHAIN=go1.21.9 GO111MODULE=on go version
go version go1.21.9 darwin/amd64

Right, so for the newer branches where we have modules on and workspaces adopted, that just leaves #125010

We could try to mitigate this in the older branches ... but it's a fairly niche issue and the workspaces change is too large to backport so we'll have to write something custom for the old branches ... and you can already work around it by just overriding the build image to have the right go versoin.