CentaurusInfra/alnair

Containerize vGPU server leads cgroup.procs content invisible (leads to process util inquiry always 0, compute control failed)

Closed this issue · 4 comments

  1. investigating through objdump, compiler, and Makefile specifications to make sure all users can build the same .so file.
  2. may need to add debug message in the intercept lib to check out the followings.
    gpu utils by process
    time interval
    fill rate
    cuda call token consumptions

mount /sys/fs/cgroup/ from host to container causing the difference, may overwrite container's own cgroup info
the required cgroup.process ID info is not retrieved correctly. So the gpu utilization by process did not get the real gpu utils.

the issue must be solved. manual installation requires too much work from the user side. 1)install nvidia tool kit, 2) install go lang, 3) copy paste .so and 4) launch vgpu server, and device plugin process on each gpu node....

current plan:

  1. adding debug message to check the process id obtained from "/var/lib/alnair/workspace/cgroup.procs" in two different set up. Verify the process id is wrong in the containerized vgpu server and user container
  2. change the /sys/fs/cgroup/ mounting point in the vgpu server
  3. verify vgpu server can get the container process id correctly and user container load it correctly in the file.

confirmed that through mounting /sys/fs/cgroup, cgroup.procs file is there, but the file is empty, no process id visible in the container, which is unlike in the host. Mount to different location won't solve this problem.
this is mounting kind of created a nested docker hierarchy, which may not make sense. could search docker in docker more.
however, current solution is switched to mount docker socket, and ask process id through docker top <containerID> to obtain all the pid in the container.

close by #130
in #130 also add container ID parsing support from cgroupfs. Now both cgroupfs and systemd driver are supported.