Concurrently create sharegpu instance will cause creation to fail

Question

Concurrently create sharegpu instance will cause creation to fail

guunergooner opened this issue 5 years ago · 0 comments

guunergooner commented 5 years ago

What happened:

Create a sharegpu instance of a large image concurrently, and delete some sharegpu instances when the image is pulled, which will cause the sharegpu instance creation to fail.

What you expected to happen:

Create a sharegpu instance of a large image concurrently, and delete some sharegpu instances when the image is pulled, which other sharegpu instance creation success.

How to reproduce it (as minimally and precisely as possible):

Create a sharegpu instance of a large image concurrently.
And delete some sharegpu instances when the image is pulled.
Wait for the image to be pulled, and sharegpu instance creation will fail.

Anything else we need to know?:

describe pod error events

  Warning  Failed     12m                  kubelet, ser-330 Error: failed to start container "k8s-deploy-ubhqko-1592387682017": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=16101 /data/docker_rt/overlay2/b647088d3759dc873fe4f60ba3b9d9de7eb85578fe17c2b2af177bb49d048450/merged]\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\n\\\"\"": unknown

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14-20200217", GitCommit:"883cfa7a769459affa307774b12c9b3e99f4130b", GitTreeState:"clean", BuildDate:"2020-02-17T14:06:28Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:

BareMetal User Provided Infrastructure

OS (e.g: cat /etc/os-release):

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Kernel (e.g. uname -a):

Linux ser-330 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

pod metadata annotations

 $ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.metadata.annotations'
{
  "ALIYUN_COM_GPU_MEM_ASSIGNED": "true",
  "ALIYUN_COM_GPU_MEM_ASSUME_TIME": "1592388290278113475",
  "ALIYUN_COM_GPU_MEM_DEV": "24",
  "ALIYUN_COM_GPU_MEM_IDX": "1",
  "ALIYUN_COM_GPU_MEM_POD": "8"
}

pod status container statuses last state

 $ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.status.containerStatuses[].lastState'
{
  "terminated": {
    "containerID": "docker://307060463dcf85c135d89abeb50edaa493b5042f47a4d5d74eccc30b71edf245",
    "exitCode": 128,
    "finishedAt": "2020-06-17T10:20:49Z",
    "message": "OCI runtime create failed: container_linux.go:344: starting container process caused \"process_linux.go:424: container init caused \\\"process_linux.go:407: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=5008 /data/docker_rt/overlay2/02cda4031418bb8cdf08e94213adb066981257069e48d8369cb3b9ab3e37f274/merged]\\\\\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\\\\\n\\\\\\\"\\\"\": unknown",
    "reason": "ContainerCannotRun",
    "startedAt": "2020-06-17T10:20:49Z"
  }
}

gpushare scheduler extender log

[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:17: check if the pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl can be scheduled on node ser-330
[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:31: The pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in the namespace k8s-common-ns can be scheduled on ser-330
[ debug ] 2020/06/17 09:54:43 routes.go:121: gpusharingBind ExtenderArgs ={k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl k8s-common-ns 90fddd7e-b080-11ea-9b44-0cc47ab32cea ser-330}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:143: Allocate() ----Begin to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:220: reqGPU for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns: 8
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:239: Find candidate dev id 1 for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns successfully.
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:147: Allocate() 1. Allocate GPU ID 1 to pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns.----
[  info ] 2020/06/17 09:54:43 controller.go:286: Need to update pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[ALIYUN_COM_GPU_MEM_IDX:1 ALIYUN_COM_GPU_MEM_POD:8 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1592387683318737367 ALIYUN_COM_GPU_MEM_DEV:24]
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:179: Allocate() 2. Try to bind pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in k8s-common-ns namespace to node  with &Binding{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl,GenerateName:,Namespace:,SelfLink:,UID:90fddd7e-b080-11ea-9b44-0cc47ab32cea,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Target:ObjectReference{Kind:Node,Namespace:,Name:ser-330,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:193: Allocate() 3. Try to add pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns to dev 1
[ debug ] 2020/06/17 09:54:43 deviceinfo.go:57: dev.addPod() Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with the GPU ID 1 will be added to device map
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:204: Allocate() ----End to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----

gpushare device plugin log

I0617 10:04:50.278017       1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.278039       1 podutils.go:91] Found GPUSharedAssumed assumed pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in namespace k8s-common-ns.
I0617 10:04:50.278046       1 podmanager.go:157] candidate pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with timestamp 1592387683318737367 is found.
I0617 10:04:50.278056       1 allocate.go:70] Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns request GPU Memory 8 with timestamp 1592387683318737367
I0617 10:04:50.278064       1 allocate.go:80] Found Assumed GPU shared Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with GPU Memory 8
I0617 10:04:50.354408       1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.354423       1 podutils.go:96] GPU assigned Flag for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl exists in namespace k8s-common-ns and its assigned status is true, so it's not GPUSharedAssumed assumed pod.