mej/nhc

NHC returns false "OK" when checking for mounted GPFS filesystems

novosirj opened this issue · 2 comments

To be honest, I'm not exactly sure if this is because GPFS is doing something non-standard, or this would happen with any stale remote filesystem type.

[root@node001 ~]# nhc -a

[root@node001 ~]# mount | grep projectsn
projectsn on /projectsn type gpfs (rw,relatime)

[root@node001 ~]# df -h /projectsn
df: '/projectsn': Stale file handle

It makes the filesystem check pretty unreliable, as this is one of the more likely things to go wrong. Any advice? This is with NHC 1.4.2, but I suspect this is not something that is version dependent.

mej commented

Hey Ryan!

Based on what I see here, NHC is reporting -- correctly -- that the filesystem is mounted. :-)

As you know, NHC very intentionally does not call df on each individual filesystem; in fact, check_fs_mount() doesn't even use the df command, but it instead looks at the current mount namespace directly via /proc/self/mounts. One of the key problems NHC takes great pains to avoid is getting hung up on mounted network filesystems that have gone AWOL (e.g., NFS hard-mounts with down/lagged server).

I haven't touched GPFS in years, and we no longer use it at LANL...but I'm open to suggestions! 😀

By any chance have you looked at @treydock's GPFS check in #71? Would something like that help your use case?

I actually don't know that this is specific to GPFS; if anyone has a tip for how to create a stale file handle (I don't actually know if I could figure out how to do it on purse for NFS or GPFS), I could probably experiment some. Personally, I'd rather NHC hang and report the hang than I would have it report a filesystem that's "technically" mounted when it means the node is unusable. These are very bad because they will drain the entire job queue if all jobs that run will fail because of a stale file handle on the user filesystem.

Would stat be safe? I went hunting around a little on the web when you asked this question, and I see someone else is using stat -t to detect a stale file handle. Again, without a way to test, it's hard to see what the behaviors would be. I completely understand not wanting to use df or something that is likely to hang under more circumstances.