Containerize vGPU server leads cgroup.procs content invisible (leads to process util inquiry always 0, compute control failed)

Question

Containerize vGPU server leads cgroup.procs content invisible (leads to process util inquiry always 0, compute control failed)

Closed this issue 2 years ago · 4 comments

investigating through objdump, compiler, and Makefile specifications to make sure all users can build the same .so file.
may need to add debug message in the intercept lib to check out the followings.
gpu utils by process
time interval
fill rate
cuda call token consumptions

Answer 1 · 2022-04-14T00:10:30.000Z

mount /sys/fs/cgroup/ from host to container causing the difference, may overwrite container's own cgroup info
the required cgroup.process ID info is not retrieved correctly. So the gpu utilization by process did not get the real gpu utils.

Answer 2 · 2022-05-22T21:24:41.000Z

the issue must be solved. manual installation requires too much work from the user side. 1)install nvidia tool kit, 2) install go lang, 3) copy paste .so and 4) launch vgpu server, and device plugin process on each gpu node....

current plan:

adding debug message to check the process id obtained from "/var/lib/alnair/workspace/cgroup.procs" in two different set up. Verify the process id is wrong in the containerized vgpu server and user container
change the /sys/fs/cgroup/ mounting point in the vgpu server
verify vgpu server can get the container process id correctly and user container load it correctly in the file.

Answer 3 · 2022-05-25T20:57:40.000Z

confirmed that through mounting /sys/fs/cgroup, cgroup.procs file is there, but the file is empty, no process id visible in the container, which is unlike in the host. Mount to different location won't solve this problem.
this is mounting kind of created a nested docker hierarchy, which may not make sense. could search docker in docker more.
however, current solution is switched to mount docker socket, and ask process id through docker top <containerID> to obtain all the pid in the container.

Answer 4 · 2022-06-02T21:10:54.000Z

close by #130
in #130 also add container ID parsing support from cgroupfs. Now both cgroupfs and systemd driver are supported.