Turndown fails on GKE due to empty zone string

Observed problem

Turndown fails to run on a GKE cluster with the following config info:

cluster-turndown-2.0.1
GKE v1.22.8-gke.202

Logs from user environment:

I0706 21:00:43.033004       1 main.go:118] Running Kubecost Turndown on: REDACTED
I0706 21:00:43.059698       1 validator.go:41] Validating Provider...
I0706 21:00:43.061743       1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063220       1 gkemetadata.go:92] [Error] metadata: GCE metadata "instance/attributes/kube-env" not defined
I0706 21:00:43.063445       1 namedlogger.go:24] [GKEClusterProvider] Loading node pools for: [ProjectID: REDACTED, Zone: , ClusterID: REDACTED]
I0706 21:00:43.192046       1 validator.go:27] [Error]: Failed to load node groups: rpc error: code = InvalidArgument desc = Location "" does not exist.

Source of the error in code

This "Loading node pools" message, followed by the error comes from here in the GKE provider.

cluster-turndown/pkg/cluster/provider/gkeclusterprovider.go

Lines 169 to 183 in c74e3bb

    
           // GetNodePools loads all of the provider NodePools in a cluster and returns them. 
        
           func (p *GKEClusterProvider) GetNodePools() ([]NodePool, error) { 
        
           	ctx := context.TODO() 
        
           	projectID := p.metadata.GetProjectID() 
        
           	zone := p.metadata.GetMasterZone() 
        
           	cluster := p.metadata.GetClusterID() 
        
           	req := &container.ListNodePoolsRequest{Parent: p.getClusterResourcePath()} 
        
           	p.log.Log("Loading node pools for: [ProjectID: %s, Zone: %s, ClusterID: %s]", projectID, zone, cluster) 
        
           	resp, err := p.clusterManager.ListNodePools(ctx, req) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	}

The request being executed uses a path generator which is filling in the empty zone string, causing the error.

cluster-turndown/pkg/cluster/provider/gkeclusterprovider.go

Lines 496 to 500 in c74e3bb

    
           // gets the fully qualified resource path for the cluster 
        
           func (p *GKEClusterProvider) getClusterResourcePath() string { 
        
           	return fmt.Sprintf("projects/%s/locations/%s/clusters/%s", 
        
           		p.metadata.GetProjectID(), p.metadata.GetMasterZone(), p.metadata.GetClusterID()) 
        
           }

We're using md.client.InstanceAttributeValue("kube-env") to get the GCP zone/locaion:

cluster-turndown/pkg/cluster/provider/gkemetadata.go

Lines 84 to 94 in c74e3bb

    
           func (md *GKEMetaData) GetMasterZone() string { 
        
           	z, ok := md.cache[GKEMetaDataMasterZoneKey] 
        
           	if ok { 
        
           		return z 
        
           	} 
        
           	results, err := md.client.InstanceAttributeValue("kube-env") 
        
           	if err != nil { 
        
           		klog.V(1).Infof("[Error] %s", err.Error()) 
        
           		return "" 
        
           	}

Possible cause

This may not be caused by the absence of kube-env metadata, but rather a lack of access to it. GKE offers "metadata concealment" which specifically calls out kube-env as data to be hidden. kube-env is also mentioned in GKE's NodeMetadata config "SECURE" setting.

Possible solution

Reporting user has suggested a different attribute value to use: cluster-location

curl -L -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-location
europe-west2

If this is a stable attribute provided by GKE-provisioned VMs this probably works. We could also investigate using v1/instance/zone as an alternative. It seems to be officially guaranteed on all GCP VMs. Other stable sources of node(pool) location information may be preferable, I just haven't dug deep enough to find them yet.

Other considerations

It is currently unclear if this is affecting all GKE environments or those of only a certain version, region, or configuration (e.g. metadata concealment). Any fixes here should be tested on earlier GKE versions to ensure compatibility.

@michaelmdresser We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?

@michaelmdresser One thing I wanted to point out here is that this is important to get this right for multi-zone clusters (we need the zone for the "master"). I didn't know about cluster-location but that seems like it could be adequate. I'm pretty sure that v1/instance/zone just gives you the local zone (relative to the node/pod).

I think the concealment of kube-env is because of this exploit, so we'll definitely want to find a way around this

And yes, it looks like cluster-location is likely what we want: https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity

I think this is one of those things that if concealment is enabled, then we can expect the new version of workload-identity. So maybe we can fallback to the "new" approach if the old approach doesn't work.

Thanks for the extra digging Bolt! Should be super helpful for a fix.

We need a priority status on this. Can it wait till v1.98 or does it need to go into v1.97?

@Adam-Stack-PM I haven't had time to investigate the impact, so I can't give it a priority. This may be most GKE clusters (high priority, probably 1.97) or very few GKE clusters (on the fence about 1.97 vs. 1.98).

@michaelmdresser, Thanks for the context here. I am labeling it P1 for now with a requirement to understand the impact before releasing v1.97.

I am facing the exact same issue. I noticed that the identified fixed has been removed from 1.97
Was there a reason to remove this from the scope of 1.97 ? Any impact identified ?

Thanks a lot for your support on this.

	// GetNodePools loads all of the provider NodePools in a cluster and returns them.
	func (p *GKEClusterProvider) GetNodePools() ([]NodePool, error) {
	ctx := context.TODO()

	projectID := p.metadata.GetProjectID()
	zone := p.metadata.GetMasterZone()
	cluster := p.metadata.GetClusterID()

	req := &container.ListNodePoolsRequest{Parent: p.getClusterResourcePath()}
	p.log.Log("Loading node pools for: [ProjectID: %s, Zone: %s, ClusterID: %s]", projectID, zone, cluster)

	resp, err := p.clusterManager.ListNodePools(ctx, req)
	if err != nil {
	return nil, err
	}

	// gets the fully qualified resource path for the cluster
	func (p *GKEClusterProvider) getClusterResourcePath() string {
	return fmt.Sprintf("projects/%s/locations/%s/clusters/%s",
	p.metadata.GetProjectID(), p.metadata.GetMasterZone(), p.metadata.GetClusterID())
	}

	func (md *GKEMetaData) GetMasterZone() string {
	z, ok := md.cache[GKEMetaDataMasterZoneKey]
	if ok {
	return z
	}

	results, err := md.client.InstanceAttributeValue("kube-env")
	if err != nil {
	klog.V(1).Infof("[Error] %s", err.Error())
	return ""
	}