kubernetes/kubernetes

app Container can't reuse its init Container cpuset in a specific condition

lianghao208 opened this issue · 11 comments

What happened?

We can't make sure app Container always reuses init Container cpuset, which may lead to the waste of CPU(init container has already exited but the cpuset can not be reused by other containers) and not enough cpus available to satis fy request error.

image

What did you expect to happen?

app Container always reuses init Container cpuset after init Container exits.

How can we reproduce it (as minimally and precisely as possible)?

This is one of the spefic condition that might cause the issue:

  • Pod A ready to allocate cpuset with init container and app container both request 92 cpu.
  • Pod B already running on the node and ready to be deleted.
  1. Pod A's init container starts to allocate cpuset(4-24,48-60,73-84,100-120,144-156,169-180):
I0510 16:40:21.232949   20266 state_mem.go:80] "Updated desired CPUSet" podUID="2f9922ce-df66-4b58-abd8-01187b813318" containerName="init-container" cpuSet="4-24,48-60,73-84,100-120,144-156,169-180"
  1. Pod A's init container exits.
  2. Before Pod A's app container starts to allocate cpuset, Pod B gets deleted and release it's cpuset(cpuSet="0-3,25-47,61-72,85-99,121-143,157-168,181-191):
I0510 16:40:27.759335   20266 state_mem.go:107] "Deleted CPUSet assignment" podUID="74510e24-48ba-4fd7-ab85-80dd99c6df5d" containerName="deleted-container"
I0510 16:40:27.759714   20266 state_mem.go:88] "Updated default CPUSet" cpuSet="0-3,25-47,61-72,85-99,121-143,157-168,181-191"
  1. Pod A's app container starts to allocate cpuset.
    What we expect is that Pod A's app container reuses its init container's cpuset.
    But due to Pod B's deletion, it won't allocate the same cpuset as its init container.(4-49,100-145):
 I0510 16:40:27.989453   20266 state_mem.go:80] "Updated desired CPUSet" podUID="2f9922ce-df66-4b58-abd8-01187b813318" containerName="app-container" cpuSet="4-49,100-145"

Now we have Pod A's init container taking cpuset: 4-24,48-60,73-84,100-120,144-156,169-180
And Pod A's app container taking cpuset: 4-49,100-145
The init container cpuset won't be reused as expected.

A new Pod C starts to allocate cpuset, it may get not enough cpus available to satis fy request error

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
1.30

Cloud provider

NONE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

/sig node

/cc @klueska Hi Klues, I noticed you have solved some similar issues like #102014, I wonder if you have encountered this issue before.

The piont is: the cpuset allocations for init-containers and app-containers differ due to changes in available cpusets in the time interval between the start of the init-container and the app-container.

related: #94220

#124282
similar issue?

related: #94220

@ffromani Thanks for the mention, the issue I mention in this issue is a little different from #94220 .
In #94220, the bug is caused by different cpu request between init-container and app-container.
This bug might be caused by the changes of available cpusets in the time interval between the start of the init-container and the app-container.

#124282 similar issue?

@chengjoey Not exactly the same issue. In #124282 , the init-container and app-container request different amount of cpu(init > app), and the cpuset from init-container can't be released even though it has exited (similar to #94220).

But in this case, init-container and app-container request same amount of cpu(init == app), so this is a kubelet issue instead of kube-scheduler.

related: #94220

@ffromani Thanks for the mention, the issue I mention in this issue is a little different from #94220 . In #94220, the bug is caused by different cpu request between init-container and app-container. This bug might be caused by the changes of available cpusets in the time interval between the start of the init-container and the app-container.

Yes, I realized after re-reading the description of this issue. I'd need to check if the system does guarantee the maximum reuse of init container cpu cores when allocating the app container cpu cores. Nevertheless, it's a very desirable property the system should strive to ensure. My gut feeling is there is just a bug in this area, I remember various conversations over time.

the core issue is here: https://github.com/kubernetes/kubernetes/blob/v1.30.0/pkg/kubelet/cm/cpumanager/policy_static.go#L394

with this line of code, all the available CPUs are put in a single pool. IOW, nothing guarantees that the reusable CPUs from terminated init container will be consumed first, or at all if the system has enough CPUs to fulfill the app container requirement.

I vaguely remember some past conversations in this area about guaranteeing optimal allocation in the context of the topology manager-enforced constraints. Also, I wonder if and how we should extend this guarantee. IOW, should the reuse be best-effort (and so, arguably, there's no bug?)

Perhaps the best way to fix would be to add a new cpu manager policy option.

/triage accepted
/priority backlog

@ffromani

with this line of code, all the available CPUs are put in a single pool. IOW, nothing guarantees that the reusable CPUs from terminated init container will be consumed first, or at all if the system has enough CPUs to fulfill the app container requirement.

In this case, should we release init-container cpuset as soon as it exits? If a init-container exits successfully and won't restart anymore, it's cpuset either be reused by its own pod's app-container, or other pods' containers. Or else this "available" cpuset will not be used at all.
However, From the scheduler perspective, it considers these cpu as available.

I vaguely remember some past conversations in this area about guaranteeing optimal allocation in the context of the topology manager-enforced constraints. Also, I wonder if and how we should extend this guarantee. IOW, should the reuse be best-effort (and so, arguably, there's no bug?)

If we release init-container cpuset as soon as it exits, the reuse will be guarantee.