集群内pod数量过多的情况有可能会引起集群高负载从而雪崩，另外MiB单位有可能会引起kubelet grpc单位失败

Question

集群内pod数量过多的情况有可能会引起集群高负载从而雪崩，另外MiB单位有可能会引起kubelet grpc单位失败

qmloong opened this issue 4 years ago · 0 comments

在podmanager中会有list全量pod的操作，如果集群内pod数量过多（2w以上），并扩容大量使用gpu资源的pod时，测试0-1000，就会触发集群的list apiserver qps 10以上，引发集群雪崩
在单位为MiB的时候，设备gpumem在124GB的时候，单位为MiB，所以fake device id会有12400，测试发现kubelet在listAndWatch的gRPC调用时，返回错误，修改命名的字符串凭接可以缓解

Jun 24 18:55:09 10-12-3-162 kubelet[350652]: E0624 18:55:09.869624  350652 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (7880680 vs. 4194304)