Turndown fails on GKE due to empty zone string
Opened this issue · 8 comments
Observed problem
Turndown fails to run on a GKE cluster with the following config info:
- cluster-turndown-2.0.1
- GKE v1.22.8-gke.202
Logs from user environment:
I0706 21:00:43.033004 1 main.go:118] Running Kubecost Turndown on: REDACTED
I0706 21:00:43.059698 1 validator.go:41] Validating Provider...
I0706 21:00:43.061743 1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063220 1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063445 1 namedlogger.go:24] [GKEClusterProvider] Loading node pools for: [ProjectID: REDACTED, Zone: , ClusterID: REDACTED]
I0706 21:00:43.192046 1 validator.go:27] [Error]: Failed to load node groups: rpc error: code = InvalidArgument desc = Location "" does not exist.
Source of the error in code
This "Loading node pools" message, followed by the error comes from here in the GKE provider.
cluster-turndown/pkg/cluster/provider/gkeclusterprovider.go
Lines 169 to 183 in c74e3bb
The request being executed uses a path generator which is filling in the empty zone string, causing the error.
cluster-turndown/pkg/cluster/provider/gkeclusterprovider.go
Lines 496 to 500 in c74e3bb
We're using md.client.InstanceAttributeValue("kube-env")
to get the GCP zone/locaion:
cluster-turndown/pkg/cluster/provider/gkemetadata.go
Lines 84 to 94 in c74e3bb
Possible cause
This may not be caused by the absence of kube-env
metadata, but rather a lack of access to it. GKE offers "metadata concealment" which specifically calls out kube-env
as data to be hidden. kube-env
is also mentioned in GKE's NodeMetadata config "SECURE" setting.
Possible solution
Reporting user has suggested a different attribute value to use: cluster-location
curl -L -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-location
europe-west2
If this is a stable attribute provided by GKE-provisioned VMs this probably works. We could also investigate using v1/instance/zone
as an alternative. It seems to be officially guaranteed on all GCP VMs. Other stable sources of node(pool) location information may be preferable, I just haven't dug deep enough to find them yet.
Other considerations
It is currently unclear if this is affecting all GKE environments or those of only a certain version, region, or configuration (e.g. metadata concealment). Any fixes here should be tested on earlier GKE versions to ensure compatibility.
@michaelmdresser We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?
@michaelmdresser One thing I wanted to point out here is that this is important to get this right for multi-zone clusters (we need the zone for the "master"). I didn't know about cluster-location
but that seems like it could be adequate. I'm pretty sure that v1/instance/zone
just gives you the local zone (relative to the node/pod).
I think the concealment of kube-env
is because of this exploit, so we'll definitely want to find a way around this
And yes, it looks like cluster-location
is likely what we want: https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity
I think this is one of those things that if concealment is enabled, then we can expect the new version of workload-identity. So maybe we can fallback to the "new" approach if the old approach doesn't work.
Thanks for the extra digging Bolt! Should be super helpful for a fix.
We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?
@Adam-Stack-PM I haven't had time to investigate the impact, so I can't give it a priority. This may be most GKE clusters (high priority, probably 1.97) or very few GKE clusters (on the fence about 1.97 vs. 1.98).
@michaelmdresser, Thanks for the context here. I am labeling it P1 for now with a requirement to understand the impact before releasing v1.97.
I am facing the exact same issue. I noticed that the identified fixed has been removed from 1.97
Was there a reason to remove this from the scope of 1.97 ? Any impact identified ?
Thanks a lot for your support on this.