ROCm/ROCm-docker

ROCm docker on kvm guest

dannysemi opened this issue · 10 comments

Is it possible to run ROCm-docker from a kvm guest? I haven't had any success so far. Works fine on the kvm guest itself. But I get the following errors in the container:

rocm-user@409e97fd5bce:/opt/rocm/hsa/sample$ ./vector_copy
Initializing the hsa runtime failed.

rocm-user@409e97fd5bce:~$ ./HelloWorld
Failed to find any OpenCL platforms.
Failed to create OpenCL context.

I'm running this on an Intel i7 5960x and Vega 64. I'm using Ubuntu 16.04 with kernel upgraded to 4.13.0-36 from the hwe-16.04-edge packages on the kvm guest. I modified the Dockerfile for my rocm-terminal container to add the rocm-user to the 'video' group (I also had to reinstall make in the container).

All drivers and docker software on the kvm guest installed according to the install guide. Tried using both --device="/dev/kfd" and --privileged flags, no success.

I could be wrong, but I think what you're experiencing is similar to #33. Were you ever able to run the rocm stack from within rocm-docker? If not, then it may not be an issue of the source of docker.io or docker-ce (me) or kvm (you) since we're both seeing similar behavior. Curious if you saw similar missing dependencies like python3 and libnuma-dev if you run rocm-smi or rocminfo?

I can run the examples from the guide successfully on my kvm guest but not from within a docker container on the kvm guest. python3 and libnuma-dev were missing from the container as well, but I edited the Dockerfile to include them. rocm-smi provides identical output to my kvm guest. rocminfo results in an error.

Are /dev/kfd and /dev/dri/renderD* visible inside docker with the right permissions?

With the --privileged flag enabled they are visible in the docker container:

rocm-user@1ea5a25a0df1:$ ls /dev/kfd
/dev/kfd
rocm-user@1ea5a25a0df1:$ ls /dev/dri
card0 renderD128

This is the same output I get on the kvm guest from which I ran the container.

What should the permissions be for "ls -l", or the chmod value you'd use if needed, i.e. 755 or 777?

@fxkamd Looks like it is a permissions thing. If I run the container as root then I get the expected result. Not as secure as it could be, but it will work for my dev environment.

@dannysemi On bare metal, udevd may have a rule to add the local console user to the access control lists of /dev/kfd and /dev/dri/renderD*, so you don't need to mess with permissions yourself. That probably doesn't work in the container. You may need to change the permissions of /dev/kfd and /dev/dri/* in the container. Usually the permissions are set to 0660, group=video. Make sure your user account is in the video group, and you should be fine.

@fxkamd I tried that before with no success. Maybe the group mappings don't translate properly between host and container? If I create a user on host with all the proper permissions and then directly pass that user to the container with the -u flag then it works. Otherwise I have to run as root.

@dannysemi I managed to get rocm-enabled docker (from Ubuntu's repo, not the Docker CE repo) working on kvm guests following the instructions #33. If you need the KVM + PCIe passthrough instructions I could provide those as well (my host is Fedora 27, guest is Ubuntu 16.04.3), but since you already have rocm running on your guest I assume that's not the root cause of this issue.

Could you try upgrade to docker.ce 18.04 as instructed in quick-start.md?