dask-kubernetes-operator-role-cluster clusterrole does not have the needed ACL against pods/portforward resource

Describe the issue:
The dask-kubernetes-operator pod shows an 403 Forbidden error when trying to access the k8s api. It does not seem to have the right cluster role permissions

[2024-10-08 21:48:24,704] httpx                [INFO    ] HTTP Request: GET https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false " HTTP/1.1 403 Forbidden"

Execcing into the pod and trying the same call against the API.

kubectl exec -it -n dask-system dask-kubernetes-operator-78d4b784cf-4r455 -- sh

$ SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
$ NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
$ TOKEN=$(cat ${SERVICEACCOUNT}/token)
$ CACERT=${SERVICEACCOUNT}/ca.crt
$ curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET 'https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false'
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "pods \"MYPOD\" is forbidden: User \"system:serviceaccount:dask-system:dask-kubernetes-operator
\" cannot get resource \"pods/portforward\" in API group \"\" in the namespace \"MYNAMESPACE\"",
  "reason": "Forbidden",
  "details": {
    "name": "MYPOD",
    "kind": "pods"
  },
  "code": 403
}$

Editing the clusterrole,

$ kubectl edit clusterrole -n dask-system dask-kubernetes-operator-role-cluster

And adding
pods/portforward

Around

dask-kubernetes/dask_kubernetes/operator/deployment/helm/dask-kubernetes-operator/templates/clusterrole.yaml

Line 34 in ab1be69

resources: [pods, pods/status]

and restarting the application pod corrected the problem.

Environment:

Dask version: dask-kubernetes-operator-2024.5.0
Python version:
Operating System: Rocky 8
Install method (conda, pip, source): helm chart

Thanks for raising this. I wouldn't necessarily expect the controller Pod to be opening port forwards to the scheduler Pods, so there may be a deeper issue going on. Generally the controller will attempt to connect directly to the scheduler Pod, and that may be failing for some reason and so it is falling back to a port forward.

Could you check your logs for other failing connection messages?

Thanks @jacobtomlinson .

The following was also seen in the operator pod log:

[2024-10-08 21:46:04,848] kopf.objects         [ERROR   ] [MYNAMESPACE/MYPOD_SHORTNAME_autoscaler] Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 812, in daskautoscaler_adapt
    desired_workers = await get_desired_workers(
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 520, in get_desired_workers
    async with session.get(url) as resp:
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 1197, in __aenter__
    self._resp = await self._coro
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 608, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 976, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/usr/local/lib/python3.10/site-packages/aiohttp/streams.py", line 640, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

Yeah I'm not surprised by that one. We have three levels of fallback when communicating with the scheduler:

HTTP request to the scheduler dashboard (this is tried first but often disabled by default and results in the aiohttp error above)
Open an RPC to the scheduler Pod directly
Open a port-forward and connect the RPC over that connection

Your initial message is failing on that last step. But I'm curious why the middle step is failing at all.