AliyunContainerService/gpushare-device-plugin

Concurrently create sharegpu instance will cause creation to fail

guunergooner opened this issue · 0 comments

What happened:

  • Create a sharegpu instance of a large image concurrently, and delete some sharegpu instances when the image is pulled, which will cause the sharegpu instance creation to fail.

What you expected to happen:

  • Create a sharegpu instance of a large image concurrently, and delete some sharegpu instances when the image is pulled, which other sharegpu instance creation success.

How to reproduce it (as minimally and precisely as possible):

  • Create a sharegpu instance of a large image concurrently.
  • And delete some sharegpu instances when the image is pulled.
  • Wait for the image to be pulled, and sharegpu instance creation will fail.

Anything else we need to know?:

  • describe pod error events
  Warning  Failed     12m                  kubelet, ser-330 Error: failed to start container "k8s-deploy-ubhqko-1592387682017": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=16101 /data/docker_rt/overlay2/b647088d3759dc873fe4f60ba3b9d9de7eb85578fe17c2b2af177bb49d048450/merged]\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\n\\\"\"": unknown

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14-20200217", GitCommit:"883cfa7a769459affa307774b12c9b3e99f4130b", GitTreeState:"clean", BuildDate:"2020-02-17T14:06:28Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
BareMetal User Provided Infrastructure
  • OS (e.g: cat /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a):
Linux ser-330 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
  • pod metadata annotations
 $ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.metadata.annotations'
{
  "ALIYUN_COM_GPU_MEM_ASSIGNED": "true",
  "ALIYUN_COM_GPU_MEM_ASSUME_TIME": "1592388290278113475",
  "ALIYUN_COM_GPU_MEM_DEV": "24",
  "ALIYUN_COM_GPU_MEM_IDX": "1",
  "ALIYUN_COM_GPU_MEM_POD": "8"
}
  • pod status container statuses last state
 $ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.status.containerStatuses[].lastState'
{
  "terminated": {
    "containerID": "docker://307060463dcf85c135d89abeb50edaa493b5042f47a4d5d74eccc30b71edf245",
    "exitCode": 128,
    "finishedAt": "2020-06-17T10:20:49Z",
    "message": "OCI runtime create failed: container_linux.go:344: starting container process caused \"process_linux.go:424: container init caused \\\"process_linux.go:407: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=5008 /data/docker_rt/overlay2/02cda4031418bb8cdf08e94213adb066981257069e48d8369cb3b9ab3e37f274/merged]\\\\\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\\\\\n\\\\\\\"\\\"\": unknown",
    "reason": "ContainerCannotRun",
    "startedAt": "2020-06-17T10:20:49Z"
  }
}
  • gpushare scheduler extender log
[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:17: check if the pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl can be scheduled on node ser-330
[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:31: The pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in the namespace k8s-common-ns can be scheduled on ser-330
[ debug ] 2020/06/17 09:54:43 routes.go:121: gpusharingBind ExtenderArgs ={k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl k8s-common-ns 90fddd7e-b080-11ea-9b44-0cc47ab32cea ser-330}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:143: Allocate() ----Begin to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:220: reqGPU for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns: 8
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:239: Find candidate dev id 1 for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns successfully.
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:147: Allocate() 1. Allocate GPU ID 1 to pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns.----
[  info ] 2020/06/17 09:54:43 controller.go:286: Need to update pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[ALIYUN_COM_GPU_MEM_IDX:1 ALIYUN_COM_GPU_MEM_POD:8 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1592387683318737367 ALIYUN_COM_GPU_MEM_DEV:24]
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:179: Allocate() 2. Try to bind pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in k8s-common-ns namespace to node  with &Binding{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl,GenerateName:,Namespace:,SelfLink:,UID:90fddd7e-b080-11ea-9b44-0cc47ab32cea,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Target:ObjectReference{Kind:Node,Namespace:,Name:ser-330,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:193: Allocate() 3. Try to add pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns to dev 1
[ debug ] 2020/06/17 09:54:43 deviceinfo.go:57: dev.addPod() Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with the GPU ID 1 will be added to device map
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:204: Allocate() ----End to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----
  • gpushare device plugin log
I0617 10:04:50.278017       1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.278039       1 podutils.go:91] Found GPUSharedAssumed assumed pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in namespace k8s-common-ns.
I0617 10:04:50.278046       1 podmanager.go:157] candidate pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with timestamp 1592387683318737367 is found.
I0617 10:04:50.278056       1 allocate.go:70] Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns request GPU Memory 8 with timestamp 1592387683318737367
I0617 10:04:50.278064       1 allocate.go:80] Found Assumed GPU shared Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with GPU Memory 8
I0617 10:04:50.354408       1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.354423       1 podutils.go:96] GPU assigned Flag for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl exists in namespace k8s-common-ns and its assigned status is true, so it's not GPUSharedAssumed assumed pod.