Failing to cleanup kfd_process information in sysfs
morrone opened this issue · 4 comments
We have noticed that the process directories under /sys/devices/virtual/kfd/kfd/proc are never being cleaned up. For instance, after a run of "rocm-bandwidth-test", the related process's directory under /sys/devices/virtual/kfd/kfd/proc stays around forever.
We are using the rocm 4.2.0 driver against a 4.18.0 kernel.
The code is using the mmu_notifier_put() strategy.
In debugging with systemtap, it would appear that kfd_process_notifier_release() is being called, but there appears to be no call to kfd_process_free_notifier().
I also am detecting no call to kfd_process_wq_release() using systemtap, and that would appear to be where sysfs_remove_file() would be called.
Are we expecting kfd_process_notifier_release() to be called before kfd_process_free_notifier()?
Is it our expectation that the mmu_notifier_put() in kfd_process_notifier_release() should allow kfd_process_free_notifier() to later be triggered, allowing the final kfd_unref_process()?
4.18 sounds like you're using RHEL 8 or CentOS 8. There is a workaround for a bug in the RHEL/CentOS 8.3 kernel in ROCm 4.2. Is that somehow not working for you? Or are you using a different RHEL/CentOS version that is not covered by this workaround?
commit 51c9f2d
Author: Felix Kuehling Felix.Kuehling@amd.com
Date: Wed Jan 20 14:29:34 2021 +0800
drm/amdkcl: Work around mmu_notifier_put issue on RHEL 8.3
The DRM backport from kernel 5.6 includes some MMU notifier changes
that cause problems with the mmu_notifier_put function. The
free_notifier never gets called. This leads to a leak of kfd_process
structures and their doorbells.
Work around this by falling back to the old method of releasing the
MMU notifier and destryoing the process structure.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
This is RHEL based, but for the TOSS OS. We have a patched kernel, and it looks like we are taking the RHEL rock dkms rpm and reworking it into a statically built kmod rpm.
I suspect that something in our stack doesn't let that patch undef HAVE_MMU_NOTIFIER_PUT, because it certainly looks like our end product was compiled to use mmu_notifier_put() rather than the alternate method.
Thanks, this helps alot! Now I can stop trying to fix mmu notification and just focus on making it build to use the alternate method.
Here are the versions on our system:
{noformat}
[ 130.804315] [drm] amdgpu version: 5.9.25
[ 130.808273] [drm] OS DRM version: 5.9.0
{noformat}
Commit 51c9f2d checks for DRM_PATCH == 6, so that is almost certainly why it doesn't drop undef HAVE_MMU_NOTIFIER_PUT for us at drm patch level 9. That is easy enough to patch and test on our side.