Nodes - Pods Tolerations
uriafranko opened this issue · 5 comments
Fixed after manually installing nvidia-device-plugin in the cluster with those settings:
tolerations:
- key: 'nvidia.com/gpu'
effect: 'NoSchedule'
value: 'present'
- key: 'flyte.org/node-role'
operator: 'Equal'
value: 'worker'
effect: 'NoSchedule'
Now the nvidia/gpu
is exposed but for some reason, the nvidia pods try to init on non-gpu nodes aswell...
@uriafranko Fixed. I changed the taints
map on the eks
module. Labels are still there, they are maybe useful for filtering, also GPU taints are still there, but no flyte-node=worker
.
Thanks
Fixed after manually installing nvidia-device-plugin in the cluster with those settings:
tolerations: - key: 'nvidia.com/gpu' effect: 'NoSchedule' value: 'present' - key: 'flyte.org/node-role' operator: 'Equal' value: 'worker' effect: 'NoSchedule'Now the
nvidia/gpu
is exposed but for some reason, the nvidia pods try to init on non-gpu nodes aswell...
For people facing the same issue, to install the nvidia-device-plugin, it's an additional helm install in the k8s cluster (addtionally to the flyte installation with helm). Here is how to install it with helm.
And in order to avoid having the nvidia-device-plugin being deployed on non-GPU nodes (and being shown as a pod in CrashLoopBackoff
), you can add a nodeSelector in the nvidia-device-plugin configuration like this:
nodeSelector:
k8s.amazonaws.com/accelerator: nvidia-tesla-t4 # pick a label which is specific to your GPU nodes, to select them
In the end, my config file (called nvidia-device-plugin-values.yaml
) looks like that:
nodeSelector:
k8s.amazonaws.com/accelerator: nvidia-tesla-t4
tolerations:
- key: 'nvidia.com/gpu'
effect: 'NoSchedule'
value: 'present'
- key: 'flyte.org/node-role'
operator: 'Equal'
value: 'worker'
effect: 'NoSchedule'
And I triggered the plugin install with that file:
helm install nvdp nvdp/nvidia-device-plugin --version=0.14.1 --namespace nvidia-device-plugin --create-namespace --values=nvidia-device-plugin-values.yaml
Thank you @qchenevier!
I was wondering if you'd like to add these instructions to the tutorial?