kata-containers/agent

fail to hotplug device if host memory size large enough

jongwu opened this issue · 7 comments

In the current implementation, the maxMem in kata guest is the host memory size. in general, host memory size is not so large so nothing special happen. But if the host memory size large enough, the maxMem will be increase along it. qemu will reserve a large memory zone for the normal memory then the PCIE address area may be affected, especially the base address. As the device path under /sys is subjected by the PCIE base address, that's say, the prefix of the device path, which is named "rootBusPath" in kata-agent, is not static. It is against the current implementation in kata-agent.

How to reproduce this bug:
make sure memory size in your host is large enough, but I'm not sure the lower limit of it, in arm64 it will be around 250G, or you can set a large maxMem in kata-runtime and recompile.
hot plug device using docker:
docker run --rm --runtime kata-runtime --device /dev/loop0 ubuntu stat /dev/loop0

error will occur:
docker: Error response from daemon: OCI runtime create failed: rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0:0:0:0/block: unknown.
ERRO[0004] error waiting for container: context canceled

I have only tested it on arm64. I guess this bug is cross arch, but I'm not sure. so @jodh-intel @alicefr - can you try this on x86 ,ppc and s390?

Fwiw, x86 should not be affected, the sysfs path for the root bus is pretty much static there. ppc64 shouldn't be affected by this specific bug - the root bus path shouldn't be affected by max memory. However, the root bus sysfs path can vary, at least in theory, for a bunch of other reasons on ppc, so I suspect your change will be vaulable there. I don't know about s390.

thanks @dgibson !

@jongwu s390x is a very special case. It is not using PCI devices, but CCW. So, my guess is that this doesn't affect s390x at all.
However, I changed recently job so I don't have access to an s390x system to verify it. Ccing @jschintag

@alicefr - , on arm64, this kata bug is related with qemu. the PCIE memory map overlaps with the normal memory ( I don't think it is a bug of qemu) then the base address of PCIE may vary with the normal memory size. so I think the rust agent also has this bug on arm64, even I have not tested it.

@jongwu yes, I see. I think your PR is fine for s390x :)

Actually I think I was mistaken about ppc64. The firmware device tree path for the (first) root bus does change, but the sysfs path does not, AFAICT. However we will need to support multiple root buses for ppc64 at some point, and this change is a useful preliminary step for that.

I tested the PR on s390x and there are no problems from my side.
As already metioned this shouldn't affect s390x. But i don't have a machine to test a setup with 250GB+ Memory either.