AliyunContainerService/gpushare-device-plugin

How to guarantee the pod to be running after Allocate?

wjliu opened this issue · 5 comments

wjliu commented

Hi @cheyang
I have a question about the logic of pick pod in Allocated function.
In my sense, the Allocate params just pass the device id to Device Plugin and there's nothing about container or pod. Why picks a pod and set its ALIYUN_COM_GPU_MEM_ASSIGNED to true in Allocate function ? Does it can guarantee this pod to be running immediately after Allocate? How to realize ?

Maybe this part can help you. https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/designs/designs.md#3-run-the-deployment-on-the-node

We can't guarantee the pod to be running after Allocate. But we can check which pod has been processed by allocate method by turning ALIYUN_COM_GPU_MEM_ASSIGNED from False to True.

wjliu commented

thx

wjliu commented

Hi @cheyang
I read the source code of Kubelet and found that the Pod has been determined before calling the Allocate function of DevicePlugin, but only device id list passed to the Allocate function, and there is no Pod information. So the Pod selected in the Allocate function of gpushare-device-plugin is not necessarily the Pod selected by Kubelet. This means that the Pod selected in the Allocate function is not necessarily the Pod that the Kubelet will start immediately, so the device id list set may be wrong. I still can't understand why.
Can you explain it in detail please?

wjliu commented

Because dealing with Pods in Kubelet is orderly? So can you choose the Pod according to assume time in the Allocate function?

Hi @cheyang @wjliu

I am having the same doubts that were raised by @wjliu. Can you please shed some light on these?