GoogleCloudPlatform/gke-autoneg-controller

Observing ACCESS_TOKEN_SCOPE_INSUFFICIENT when creating service

Closed this issue · 11 comments

I have a service defined with

apiVersion: v1
kind: Service
metadata:
  name: frontend-svc
  annotations:
    cloud.google.com/neg: '{"exposed_ports": {"443":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"443":[{"name":"https-be","max_connections_per_endpoint":1000}]}}'
spec:
  selector:
    app: frontend-app
  type: NodePort
  ports:
    - protocol: TCP
      port: 443
      targetPort: 3000

when I ran the kubectl command to create it, I observe the following events:

Events:
  Type     Reason        Age                 From                Message
  ----     ------        ----                ----                -------
  Normal   Sync          32s                 autoneg-controller  Synced NEGs for "default/frontend-svc" as backends to backend service "https-be" (port 443)
  Normal   Create        23s                 neg-controller      Created NEG "k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8" for default/frontend-svc-k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8--/443-3000-GCE_VM_IP_PORT-L7 in "us-central1-f".
  Warning  BackendError  11s (x13 over 32s)  autoneg-controller  googleapi: Error 403: Request had insufficient authentication scopes.
Details:
[
  {
    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
    "domain": "googleapis.com",
    "metadatas": {
      "method": "compute.v1.BackendServicesService.Get",
      "service": "compute.googleapis.com"
    },
    "reason": "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
  }
]

I can see however that the autoneg IAM role has the permission to perform this operation included:

$ gcloud iam roles describe autoneg --project=$PROJECT_ID
etag: REDACTED
includedPermissions:
- compute.backendServices.get
- compute.backendServices.update
- compute.healthChecks.useReadOnly
- compute.networkEndpointGroups.use
- compute.regionBackendServices.get
- compute.regionBackendServices.update
- compute.regionHealthChecks.useReadOnly
name: projects/${PROJECT_ID}/roles/autoneg
stage: ALPHA
title: autoneg

Any suggestions on how to debug and resolve? What makes this acutely frustrating is that there is no mention of these IAM issues in any of the GCP, GKE, autoneg docs or community forums.

Pinging a few top contributors to get some 👀 on this @rosmo @fdfzcq

rosmo commented

Few questions:

  1. how did you deploy Autoneg?
  2. can you check the autoneg-controller-manager logs for context around the error?
  3. you do have Workload Identity enabled on your cluster? (and proper mapping between the KSA and GCP service account, eg. annotation on the KSA and IAM for SA)
  4. there is a backend called https-be etc. (all the config makes sense)
  1. I deployed Autoneg with
PROJECT_ID=${PROJECT_ID} deploy/workload_identity.sh  # runs a few gcloud commands

kubectl apply -f deploy/autoneg.yaml

kubectl annotate sa -n autoneg-system autoneg-controller-manager \
  iam.gke.io/gcp-service-account=autoneg-system@${PROJECT_ID}.iam.gserviceaccount.com
  1. Here's what I could gather from the manager container logs:
nathan:hello-cluster$ kubectl -n=autoneg-system logs autoneg-controller-manager-f5ddc69b8-vtpw5 -c manager

1.6807986132289813e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
1.6807986132294307e+09	INFO	setup	starting manager
1.6807986132303178e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
1.6807986132304182e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0406 16:30:13.230520       1 leaderelection.go:248] attempting to acquire leader lease autoneg-system/9fe89c94.controller.autoneg.dev...
I0406 16:30:13.240961       1 leaderelection.go:258] successfully acquired lease autoneg-system/9fe89c94.controller.autoneg.dev
1.6807986132411487e+09	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"autoneg-system","name":"9fe89c94.controller.autoneg.dev","uid":"ba03cd7e-28e5-40e8-ac14-69c65f3b5341","apiVersion":"coordination.k8s.io/v1","resourceVersion":"7573723"}, "reason": "LeaderElection", "message": "autoneg-controller-manager-f5ddc69b8-vtpw5_3798322b-67e3-4fde-bb42-a672bc56ecca became leader"}
1.6807986132413092e+09	INFO	Starting EventSource	{"controller": "service", "controllerGroup": "", "controllerKind": "Service", "source": "kind source: *v1.Service"}
1.680798613241329e+09	INFO	Starting Controller	{"controller": "service", "controllerGroup": "", "controllerKind": "Service"}
1.6807986134227598e+09	INFO	Starting workers	{"controller": "service", "controllerGroup": "", "controllerKind": "Service", "worker count": 1}
1.680799113217557e+09	INFO	Applying intended status	{"controller": "service", "controllerGroup": "", "controllerKind": "Service", "service": {"name":"frontend-svc","namespace":"default"}, "namespace": "default", "name": "frontend-svc", "reconcileID": "58153b0f-7640-4a7c-995c-37a708c11a9a", "service": "default/frontend-svc", "status": {"backend_services":{"443":{"https-be":{"name":"https-be","max_connections_per_endpoint":1000}},"80":{"http-be":{"name":"http-be","max_rate_per_endpoint":100}}},"network_endpoint_groups":{"443":"k8s1-1f4ed5c4-default-frontend-svc-443-9757dbe8"},"zones":["us-central1-f"]}}
1.6807991133377185e+09	ERROR	Reconciler error	{"controller": "service", "controllerGroup": "", "controllerKind": "Service", "service": {"name":"frontend-svc","namespace":"default"}, "namespace": "default", "name": "frontend-svc", "reconcileID": "58153b0f-7640-4a7c-995c-37a708c11a9a", "error": "googleapi: Error 403: Request had insufficient authentication scopes.\nDetails:\n[\n  {\n    \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n    \"domain\": \"googleapis.com\",\n    \"metadatas\": {\n      \"method\": \"compute.v1.BackendServicesService.Get\",\n      \"service\": \"compute.googleapis.com\"\n    },\n    \"reason\": \"ACCESS_TOKEN_SCOPE_INSUFFICIENT\"\n  }\n]\n\nMore details:\nReason: insufficientPermissions, Message: Insufficient Permission\n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:234
  1. Yes, to enable Workload Identity I ran. How can I check if the annotation on the KSA and GSA are properly bound/mapped?
gcloud container clusters update hello-cluster \
    --zone=$DEFAULT_ZONE \
    --workload-pool=$PROJECT_ID.svc.id.goog
  1. I don't see https-be when I run gcloud compute backend-services list.
rosmo commented

The backend service needs to be created beforehand and outside of Autoneg (eg. manually, via gcloud or by Terraform).

I see. Do you have some template gcloud compute backend-services create commands I could try? Can you also add some notes in the Autoneg README about this and the order of operations?

It's a bit confusing because the GCP docs don't reference Autoneg anywhere, and there are countably infinite ways of configuring load balancers + backends + ingresses + NEGs + instance groups + IAM rules + ... + etc. What I usually do is:

  1. Create deployment with kubectl
  2. Create service with kubectl
  3. Created managed cert with kubectl (if a new one is needed, usually accompanied by a new Cloud DNS A record beforehand)
  4. Create or update an ingress with path to k8s service using kubectl
    • This step automatically creates load balancer associated with any DNS + cert
    • It also creates backend service with network endpoint group
    • But it does not reliably or automatically create network endpoint to backend service

My guess is that I would need to create a new backend service before I perform step (4) and then do another manual step of configuring the load balancer created in that step to use the backend service I created manually (unless Autoneg does this for me).

rosmo commented

If you can leverage GKE Ingress notation, I suggest you use that. Autoneg was mainly created for two things: one where a different team manages the load balancer components and other where people wanted to use some features that weren't available in GKE Ingress controller.

For greater context, my use case is that I'm using gcloud container clusters create (GKE Standard mode) as opposed to gcloud container clusters create-auto (GKE Autopilot mode). My current issue with Standard mode is that when I create a GKE Ingress, the network endpoints are not automatically created with the node, pod IP, and port for the load balancer's (LB) network endpoint group.

So, for example, consider the frontend pod you see below:

nathan:hello-cluster$ k get po -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP          NODE                                             NOMINATED NODE   READINESS GATES
frontend-app-7fc967db4b-7m5tf    1/1     Running   0          15d     10.80.2.4   gke-hello-cluster-default-pool-a7743c1e-p8q4     <none>           <none>
hello-app-b5cd5796b-dn9ml        1/1     Running   0          15d     10.80.0.7   gke-hello-cluster-default-pool-a7743c1e-9bql     <none>           <none>
streamlit-app-565f54d89b-z92cw   1/1     Running   0          5d18h   10.80.3.2   gke-hello-cluster-analytics-pool-53ce8fb0-x2g7   <none>           <none>

I have to create the network endpoint for clients connecting through the LB to reach this pod manually today like so:

2023-04-11 13 04 58

which is obviously not the best practice. Without a network endpoint properly mapped between pod, port, and node for any given NEG, clients hitting my load balancer observe HTTP 502: failed_to_pick_backend .

You mention the GKE Ingress notation and the GKE Ingress controller. Here's what my Ingress definition currently looks like:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: frontend-app
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "frontend-static-ip"
    networking.gke.io/managed-certificates: frontend-managed-cert
    kubernetes.io/ingress.class: "gce"
spec:
  defaultBackend:
    service:
      name: frontend-svc
      port:
        number: 443

Do you see something missing from my YAML that are available off-the-shelf with GKE which would address the networking endpoint issue I'm experiencing but that I'm not using?

It could be that I'm not using Autoneg for its intended purposes, since my situation is not in your list. But I haven't found a working solution to this with just GKE alone (unless there's a solution that exists which is not documented).

rosmo commented

Do you have the cloud.google.com/neg: '{"ingress": true}' annotation on your frontend-svc service? Although I think this should not be required on GKE 1.17+ as per the documentation here: https://cloud.google.com/kubernetes-engine/docs/concepts/ingress#container-native_load_balancing

You might also be required to use a NodePort service. It also takes a while for all the necessary components to be created (5 minutes+).

rosmo commented

Also you might consider the Gateway resource as well: https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways

Thanks! I added that annotation to the YAML spec for the service (configured as NodePort) but am still seeing the 403: Request had insufficient authentication scopes error. Here's what my frontend-svc now looks like:

apiVersion: v1
kind: Service
metadata:
  name: frontend-svc
  annotations:
    cloud.google.com/neg: '{"ingress": true, "exposed_ports": {"443":{}}}'
    controller.autoneg.dev/neg: '{"backend_services":{"443":[{"name":"frontend-https-be","max_connections_per_endpoint":1000}]}}'
spec:
  selector:
    app: frontend-app
  type: NodePort
  ports:
    - protocol: TCP
      port: 443
      targetPort: 3000

and here are the events when I kubectl describe svc frontend-svc it;

  Type     Reason        Age                 From                Message
  ----     ------        ----                ----                -------
  Normal   Sync          38s                 autoneg-controller  Synced NEGs for "default/frontend-svc" as backends to backend service "frontend-https-be" (port 443)
  Warning  BackendError  13s (x13 over 38s)  autoneg-controller  googleapi: Error 403: Request had insufficient authentication scopes.
Details:
[
  {
    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
    "domain": "googleapis.com",
    "metadatas": {
      "method": "compute.v1.BackendServicesService.Get",
      "service": "compute.googleapis.com"
    },
    "reason": "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
  }
]

More details:
Reason: insufficientPermissions, Message: Insufficient Permission

I understand that Gateway in GCP is an evolution of Ingress, although it appears to be in v1beta1. I have the most experience working with ingresses (though on other platforms such as AWS and a few on-prem k8s clusters).

Turns out I didn't need Autoneg after all. Resolved with: https://stackoverflow.com/a/76040721/1773216