dask-kubernetes-operator-role-cluster clusterrole does not have the needed ACL against pods/portforward resource
Opened this issue · 3 comments
Describe the issue:
The dask-kubernetes-operator pod shows an 403 Forbidden error when trying to access the k8s api. It does not seem to have the right cluster role permissions
[2024-10-08 21:48:24,704] httpx [INFO ] HTTP Request: GET https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false " HTTP/1.1 403 Forbidden"
Execcing into the pod and trying the same call against the API.
kubectl exec -it -n dask-system dask-kubernetes-operator-78d4b784cf-4r455 -- sh
$ SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
$ NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
$ TOKEN=$(cat ${SERVICEACCOUNT}/token)
$ CACERT=${SERVICEACCOUNT}/ca.crt
$ curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET 'https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false'
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "pods \"MYPOD\" is forbidden: User \"system:serviceaccount:dask-system:dask-kubernetes-operator
\" cannot get resource \"pods/portforward\" in API group \"\" in the namespace \"MYNAMESPACE\"",
"reason": "Forbidden",
"details": {
"name": "MYPOD",
"kind": "pods"
},
"code": 403
}$
Editing the clusterrole,
$ kubectl edit clusterrole -n dask-system dask-kubernetes-operator-role-cluster
And adding
pods/portforward
Around
and restarting the application pod corrected the problem.
Environment:
- Dask version: dask-kubernetes-operator-2024.5.0
- Python version:
- Operating System: Rocky 8
- Install method (conda, pip, source): helm chart
Thanks for raising this. I wouldn't necessarily expect the controller Pod to be opening port forwards to the scheduler Pods, so there may be a deeper issue going on. Generally the controller will attempt to connect directly to the scheduler Pod, and that may be failing for some reason and so it is falling back to a port forward.
Could you check your logs for other failing connection messages?
Thanks @jacobtomlinson .
The following was also seen in the operator pod log:
[2024-10-08 21:46:04,848] kopf.objects [ERROR ] [MYNAMESPACE/MYPOD_SHORTNAME_autoscaler] Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
result = await invoke_handler(
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
result = await invocation.invoke(
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
result = await fn(**kwargs) # type: ignore
File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 812, in daskautoscaler_adapt
desired_workers = await get_desired_workers(
File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 520, in get_desired_workers
async with session.get(url) as resp:
File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 1197, in __aenter__
self._resp = await self._coro
File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 608, in _request
await resp.start(conn)
File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 976, in start
message, payload = await protocol.read() # type: ignore[union-attr]
File "/usr/local/lib/python3.10/site-packages/aiohttp/streams.py", line 640, in read
await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
Yeah I'm not surprised by that one. We have three levels of fallback when communicating with the scheduler:
- HTTP request to the scheduler dashboard (this is tried first but often disabled by default and results in the
aiohttp
error above) - Open an RPC to the scheduler Pod directly
- Open a port-forward and connect the RPC over that connection
Your initial message is failing on that last step. But I'm curious why the middle step is failing at all.