cea-hpc/shine

Race may make module unloading fail in umount

Closed this issue · 2 comments

[This is for information of anyone else who hits the problem. I
already discussed it with Aurelien, and it's presumably a "wontfix".]

I'm using shine 1.4 and Lustre 2.7 on RHEL6, but maybe it's more
widely relevant.

I found that a fraction of the time shine wasn't unloading modules
when using "-L umount". When that happens (in an rc script in my
case) a subsequent shutdown/reboot hangs, like in
https://jira.hpdd.intel.com/browse/LU-6132, at least with
Infiniband.

It turns out that /proc/fs/lustre/devices isn't updated synchronously
with the unmount of the filesystem(s) and the subsequent check on that
file that shine does in UnloadModules._already_done races with the
/proc update. As a result, the function sometimes spuriously detects devices
in use, which prevents the rmmod.

I initially worked around this by adding a short sleep and trying
again if _device_count returned > 0, which is obviously unclean.
Instead, I'm now stopping lnet with the rc script distributed with Lustre,
run after shine at shutdown, to work round the issue.

Hope that helps someone.

Reported by: davelove

  • status: new --> closed
  • assigned_to: Aurélien Degrémont
  • Resolution: --> fixed
  • Priority: blocker --> minor

Original comment by: degremont

Fixed in [093c2f]

Original comment by: degremont