kubecost/cluster-turndown

Turndown fails on GKE due to empty zone string

Opened this issue · 8 comments

Observed problem

Turndown fails to run on a GKE cluster with the following config info:

  • cluster-turndown-2.0.1
  • GKE v1.22.8-gke.202

Logs from user environment:

I0706 21:00:43.033004       1 main.go:118] Running Kubecost Turndown on: REDACTED
I0706 21:00:43.059698       1 validator.go:41] Validating Provider...
I0706 21:00:43.061743       1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063220       1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063445       1 namedlogger.go:24] [GKEClusterProvider] Loading node pools for: [ProjectID: REDACTED, Zone: , ClusterID: REDACTED]
I0706 21:00:43.192046       1 validator.go:27] [Error]: Failed to load node groups: rpc error: code = InvalidArgument desc = Location "" does not exist.

Source of the error in code

This "Loading node pools" message, followed by the error comes from here in the GKE provider.

// GetNodePools loads all of the provider NodePools in a cluster and returns them.
func (p *GKEClusterProvider) GetNodePools() ([]NodePool, error) {
ctx := context.TODO()
projectID := p.metadata.GetProjectID()
zone := p.metadata.GetMasterZone()
cluster := p.metadata.GetClusterID()
req := &container.ListNodePoolsRequest{Parent: p.getClusterResourcePath()}
p.log.Log("Loading node pools for: [ProjectID: %s, Zone: %s, ClusterID: %s]", projectID, zone, cluster)
resp, err := p.clusterManager.ListNodePools(ctx, req)
if err != nil {
return nil, err
}

The request being executed uses a path generator which is filling in the empty zone string, causing the error.

// gets the fully qualified resource path for the cluster
func (p *GKEClusterProvider) getClusterResourcePath() string {
return fmt.Sprintf("projects/%s/locations/%s/clusters/%s",
p.metadata.GetProjectID(), p.metadata.GetMasterZone(), p.metadata.GetClusterID())
}

We're using md.client.InstanceAttributeValue("kube-env") to get the GCP zone/locaion:

func (md *GKEMetaData) GetMasterZone() string {
z, ok := md.cache[GKEMetaDataMasterZoneKey]
if ok {
return z
}
results, err := md.client.InstanceAttributeValue("kube-env")
if err != nil {
klog.V(1).Infof("[Error] %s", err.Error())
return ""
}

Possible cause

This may not be caused by the absence of kube-env metadata, but rather a lack of access to it. GKE offers "metadata concealment" which specifically calls out kube-env as data to be hidden. kube-env is also mentioned in GKE's NodeMetadata config "SECURE" setting.

Possible solution

Reporting user has suggested a different attribute value to use: cluster-location

curl -L -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-location
europe-west2

If this is a stable attribute provided by GKE-provisioned VMs this probably works. We could also investigate using v1/instance/zone as an alternative. It seems to be officially guaranteed on all GCP VMs. Other stable sources of node(pool) location information may be preferable, I just haven't dug deep enough to find them yet.

Other considerations

It is currently unclear if this is affecting all GKE environments or those of only a certain version, region, or configuration (e.g. metadata concealment). Any fixes here should be tested on earlier GKE versions to ensure compatibility.

@michaelmdresser We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?

@michaelmdresser One thing I wanted to point out here is that this is important to get this right for multi-zone clusters (we need the zone for the "master"). I didn't know about cluster-location but that seems like it could be adequate. I'm pretty sure that v1/instance/zone just gives you the local zone (relative to the node/pod).

I think the concealment of kube-env is because of this exploit, so we'll definitely want to find a way around this

And yes, it looks like cluster-location is likely what we want: https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity

I think this is one of those things that if concealment is enabled, then we can expect the new version of workload-identity. So maybe we can fallback to the "new" approach if the old approach doesn't work.

Thanks for the extra digging Bolt! Should be super helpful for a fix.

We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?

@Adam-Stack-PM I haven't had time to investigate the impact, so I can't give it a priority. This may be most GKE clusters (high priority, probably 1.97) or very few GKE clusters (on the fence about 1.97 vs. 1.98).

@michaelmdresser, Thanks for the context here. I am labeling it P1 for now with a requirement to understand the impact before releasing v1.97.

I am facing the exact same issue. I noticed that the identified fixed has been removed from 1.97
Was there a reason to remove this from the scope of 1.97 ? Any impact identified ?

Thanks a lot for your support on this.