ROCm/ROCm-docker

/dev/kfd no longer exists

PhilipDeegan opened this issue · 11 comments

Not sure where it went.

Is this expected?

kknox commented

This is on your host machine, or in your container?

Host - rocm-dkms became an optional package and was seemingly removed.

Possibly something in the 1.6 to 1.7 upgrade?

kknox commented

Something went bad with the packages. rocm-dkms should not have been removed. It's a meta package that references a bunch of subpackages. I assume that if you do a lsmod now, you will no longer see the amdgpu or amdkfd kernel modules loaded. The amdkfd module is what provides the /dev/kfd device.

I tried doing a manual modprobe on amdkfd, but I got an exec error

doing a full remove/reinstall now will let you know thanks

That did the trick, can you confirm just for my own curiosity, Is the ROCM 4.11 kernel still supposed to be there? I have it, but I don't see a "rocm-kernel" package anymore so I'm not sure if it's obsolete.

Thanks

kknox commented

Great. Yes, I think the rocm-kernel package is obsolete. The corresponding equivalent package in our new dkms world is rock-dkms. With dkms, you no longer have a custom monolithic roc kernel, you should have the stock Ubuntu kernel and the rocm stuff loads like a kernel driver with dkms. With a stock ubuntu 16.04 install, if you typed uname -r, it should give you a 4.4 kernel version.

Yeah I still have 4.11.0-kfd-compute-rocm-rel-1.6-180 somehow

So the normal host kernel should work now? I thought would only work with 4.16

kknox commented

That kernel is from the 1.6 rocm (dkms does away with custom kernels). You should be able to revert back to the stock ubuntu kernel with:
sudo dpkg --purge linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-180 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-180. Sounds like you may have a mixture of 1.6 and 1.7 packages now, but it would be best (i.e. tested) if you double checked that all the 1.6 packages are removed.

congrats to all involved, that was a bit of a pain point

kknox commented

I found the uninstall directions for 1.6 here. Wouldn't hurt to try that. you may unintentionally uninstall a 1.7 package (not sure), but that would be trivial to reinstall with apt install