compute default service account does not have access to the global-game-images artifact registry repo
Opened this issue · 8 comments
After following the demo steps I noticed that initially many workloads are left not initialized, because compute default service account (project_number-compute@developer.gserviceaccount.com
) cannot pull the images as it does not have permissions to read from this registry. I fixed it manually but the IAM might be worth including in the Terraform configuration.
Have created a PR that will give the Compute Service Account the Artifact Repo reader role.
Curious on something - is this a role that the compute instance would get by default when you enabled GKE? I've never had to manually enable this on any project 🤔 so why did this happen here?
I'm wondering if #162 is actually just hiding a race condition on the GKE cluster, or am I off base?
Actually, lemme rephrase -- should this be the GKE cluster have a depends_on
the K8s APi being fully enabled?
@AlexBulankou can you share the exact input and output you were getting please? Was it an error in the Terraform, a specific image? Something else?
I did not get any deployment errors, but the container could not pull the image before I added access explicitly. Not an expert, but intuitively I would be surprised if an registry created would have compute default service account by default, because it means that any cluster in the project has this access by default, not sure if this is desired behavior for many organizations (vs. enabling a dedicated service account for a given registry).
I think this is fixed now, but to confirm:
I did not get any deployment errors, but the container could not pull the image
Sorry, not sure I'm following - containers don't pull images. Do you were seeing Image Pull Backoffs in your GKE clusters? If so, which clusters? All of them? Some of them?
Which workloads, which Deployments, which clusters. Did some work, did others not? Screenshots and details here would be very useful.
Do you were seeing Image Pull Backoffs in your GKE clusters? If so, which clusters? All of them? Some of them?
Yes. I was seeing it on game server workloads, I did not check if it was on all of them or some of them. here's an example:
{
"insertId": "wlovxtp97bip59w8",
"jsonPayload": {
"_GID": "0",
"PRIORITY": "6",
"_PID": "1790",
"SYSLOG_IDENTIFIER": "kubelet",
"_SYSTEMD_UNIT": "kubelet.service",
"_MACHINE_ID": "c6aa1e71abcbcf4326b3fdcbf82684e1",
"_SYSTEMD_INVOCATION_ID": "12fccd8e939940818873f98ba85e7ae0",
"_CAP_EFFECTIVE": "1ffffffffff",
"_BOOT_ID": "0a3608b3b8544bf7b2f9fb860e66d631",
"_UID": "0",
"_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
"_SYSTEMD_SLICE": "system.slice",
"_TRANSPORT": "stdout",
"_COMM": "kubelet",
"MESSAGE": "E0325 18:59:44.857798 1790 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"droidshooter\\\" with ImagePullBackOff: \\\"Back-off pulling image \\\\\\\"us-docker.pkg.dev/alexbu-gke-dev/global-game-images/droidshooter-server:b40b146a-8390-4569-abd7-abd5c509b1ec\\\\\\\"\\\"\" pod=\"default/droidshooter-bzlbw-qpjqv\" podUID=8d5da6d4-68d1-4c84-85d2-8407a9581739",
"_HOSTNAME": "gk3-global-game-us-centr-nap-10413t6d-18671094-nxpq",
"_CMDLINE": "/home/kubernetes/bin/kubelet --v=2 --cloud-provider=gce --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --cert-dir=/var/lib/kubelet/pki/ --kubeconfig=/var/lib/kubelet/kubeconfig --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --max-pods=32 --volume-plugin-dir=/home/kubernetes/flexvolume --node-status-max-images=25 --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --runtime-cgroups=/system.slice/containerd.service --registry-qps=10 --registry-burst=20 --config /home/kubernetes/kubelet-config.yaml \"--pod-sysctls=net.core.somaxconn=1024,net.ipv4.conf.all.accept_redirects=0,net.ipv4.conf.all.forwarding=1,net.ipv4.conf.all.route_localnet=1,net.ipv4.conf.default.forwarding=1,net.ipv4.ip_forward=1,net.ipv4.tcp_fin_timeout=60,net.ipv4.tcp_keepalive_intvl=60,net.ipv4.tcp_keepalive_probes=5,net.ipv4.tcp_keepalive_time=300,net.ipv4.tcp_rmem=4096 87380 6291456,net.ipv4.tcp_syn_retries=6,net.ipv4.tcp_tw_reuse=0,net.ipv4.tcp_wmem=4096 16384 4194304,net.ipv4.udp_rmem_min=4096,net.ipv4.udp_wmem_min=4096,net.ipv6.conf.all.disable_ipv6=1,net.ipv6.conf.default.accept_ra=0,net.ipv6.conf.default.disable_ipv6=1,net.netfilter.nf_conntrack_generic_timeout=600,net.netfilter.nf_conntrack_tcp_be_liberal=1,net.netfilter.nf_conntrack_tcp_timeout_close_wait=3600,net.netfilter.nf_conntrack_tcp_timeout_established=86400\" --pod-infra-container-image=gke.gcr.io/pause:3.6@sha256:10008c36b4611b44db1229451675d5d7d86c7ddf4ef00f883d806a01547203f6",
"_STREAM_ID": "1423d9289b624b53b7196a781694f575",
"_EXE": "/home/kubernetes/bin/kubelet",
"SYSLOG_FACILITY": "3"
},
"resource": {
"type": "k8s_node",
"labels": {
"node_name": "gk3-global-game-us-centr-nap-10413t6d-18671094-nxpq",
"cluster_name": "global-game-us-central1-02",
"location": "us-central1",
"project_id": "alexbu-gke-dev"
}
},
"timestamp": "2023-03-25T18:59:44.857881Z",
"logName": "projects/alexbu-gke-dev/logs/kubelet",
"receiveTimestamp": "2023-03-25T18:59:49.792357873Z"
}
{
"insertId": "ezoa0uf99z2sz",
"jsonPayload": {
"kind": "Event",
"apiVersion": "v1",
"reportingInstance": "",
"eventTime": null,
"message": "Error: ImagePullBackOff",
"reason": "Failed",
"type": "Warning",
"source": {
"host": "gke-global-game-eu-west1-01-default-edbb1dd5-bdf8",
"component": "kubelet"
},
"involvedObject": {
"fieldPath": "spec.containers{droidshooter}",
"uid": "8c645eaf-4f8f-4a9c-a467-e60a152aeb69",
"name": "droidshooter-nmlfb-j9xwn",
"kind": "Pod",
"resourceVersion": "1774080",
"apiVersion": "v1",
"namespace": "default"
},
"lastTimestamp": "2023-03-25T18:59:45Z",
"metadata": {
"name": "droidshooter-nmlfb-j9xwn.174fbea1220c6344",
"creationTimestamp": "2023-03-25T18:59:45Z",
"namespace": "default",
"resourceVersion": "38876",
"managedFields": [
{
"fieldsV1": {
"f:involvedObject": {},
"f:type": {},
"f:source": {
"f:component": {},
"f:host": {}
},
"f:lastTimestamp": {},
"f:count": {},
"f:reason": {},
"f:firstTimestamp": {},
"f:message": {}
},
"manager": "kubelet",
"fieldsType": "FieldsV1",
"operation": "Update",
"apiVersion": "v1",
"time": "2023-03-25T18:59:45Z"
}
],
"uid": "5d3d7ffd-d898-47da-b451-a57097419750"
},
"reportingComponent": ""
},
"resource": {
"type": "k8s_pod",
"labels": {
"project_id": "alexbu-gke-dev",
"location": "europe-west1",
"namespace_name": "default",
"cluster_name": "global-game-eu-west1-01",
"pod_name": "droidshooter-nmlfb-j9xwn"
}
},
"timestamp": "2023-03-25T18:59:45Z",
"severity": "WARNING",
"logName": "projects/alexbu-gke-dev/logs/events",
"receiveTimestamp": "2023-03-25T18:59:45.747550125Z"
}
So if I look at my global-game-images
registry, I see this permission on it:
Looking at your project I see the same permissions set on that registry permissions - so the compute
service account should be able to read from the registry.
Looking at the permissions on the compute my project I see:
Weirdly, when I look at your compute service account... it doesn't match this, it's missing the one highlighted here.
Since we've merged #162 is that fixed now?
I'm also wondering what extra org policies you may have in effect that is different from a "standard" GCP project.