topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources

Question

topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources

Opened this issue a year ago · 0 comments

Assume that a container runs on CPUs of NUMA node 0.

An admin wants to reorganize server resources so that containers will not use CPUs on NUMA/die/socket 0 anymore by removing those CPUs from AvailableResources.

When this is done, restarting the topology aware NRI plugin with new configuration fails with an internal error:

E0710 07:30:57.289447       1 nri.go:784] <= Synchronize FAILED: failed to start policy topology-aware: topology-aware: failed to start:
topology-aware: failed to restore allocations from cache:
topology-aware: failed to allocate <CPU request pod0/pod0c0: exclusive: 3><Memory request: limit:95.37M, req:95.37M> from <NUMA node #1 allocatable: MemLimit: DRAM 1.85G>:
topology-aware: internal error: NUMA node #1: can't slice 3 exclusive CPUs from , 0m available

Let's discuss if this is a bug, expected behavior or if we should provide a configuration option for forcing new CPU/memory pinning, even if it would lead into costly memory accesses/moves.

Current workaround on this error is deleting the cache and thereby forcing reassignment of resources from scratch. Using this workaround or draining a node before AvailableResources change are both heavier operations than what forcing new pinning would be.