bsdpot/pot

[BUG] Pot/Nomad orchestrated controlled FreeBSD 13.0-RELEASE node panics every couple of days

grembo opened this issue · 2 comments

Describe the bug
A fleet of multiple compute nodes running 10-20 pots each orchestrated by nomad crash every few days, usually within a couple of hours between each other.

This is a problem with FreeBSD 13.0-RELEASE's ZFS implementation, but I'm recording here anyway, as it affects pot users and there is probably a way to workaround the issue.

To Reproduce
Steps to reproduce the behavior:

  1. Install pot/nomad on a host running FreeBSD 13.0
  2. Deploy 10-20 pots
  3. Wait a few days for a panic to occur

Expected behavior
Pot orchestrated hosts shouldn't panic.

System configuration - if possible
Default is enough

Additional context
I managed to capture a kernel panic, the host crashes while invoking zfs:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x18
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80bffeca
stack pointer           = 0x28:0xfffffe01e0bd5820
frame pointer           = 0x28:0xfffffe01e0bd5830
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 91596 (zfs)
trap number             = 12
panic: page fault
cpuid = 3
time = 1641116990
KDB: stack backtrace:
#0 0xffffffff80c40295 at kdb_backtrace+0x65
#1 0xffffffff80bf5d91 at vpanic+0x181
#2 0xffffffff80bf5b63 at panic+0x43
#3 0xffffffff810878f7 at trap_fatal+0x387
#4 0xffffffff81087966 at trap_pfault+0x66
#5 0xffffffff81086f8b at trap+0x2ab
#6 0xffffffff8105b808 at calltrap+0x8
#7 0xffffffff822cabb0 at zfs_onexit_destroy+0x20
#8 0xffffffff82146768 at zfsdev_close+0x58
#9 0xffffffff80a98347 at devfs_destroy_cdevpriv+0x97
#10 0xffffffff80a9bf64 at devfs_close_f+0x64
#11 0xffffffff80b98d2b at _fdrop+0x1b
#12 0xffffffff80b9c5e9 at closef+0x1d9
#13 0xffffffff80ba0697 at closefp_impl+0x77
#15 0xffffffff8105c12e at fast_syscall_common+0xf8
Uptime: 3d16h29m24s
Dumping 7555 out of 65271 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80bf59bb in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80bf5e00 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80bf5b63 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff810878f7 in trap_fatal (frame=0xfffffe01e0bd5760, eva=24)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff81087966 in trap_pfault (frame=frame@entry=0xfffffe01e0bd5760, 
    usermode=false, signo=<optimized out>, signo@entry=0x0, 
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff81086f8b in trap (frame=0xfffffe01e0bd5760)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  _sx_xlock (sx=0x0, opts=opts@entry=0, 
    file=0xffffffff8239be7a "/usr/src/sys/contrib/openzfs/module/zfs/zfs_onexit.c", line=line@entry=89) at /usr/src/sys/kern/kern_sx.c:325
#10 0xffffffff822cabb0 in zfs_onexit_destroy (zo=0x0)
    at /usr/src/sys/contrib/openzfs/module/zfs/zfs_onexit.c:89
#11 0xffffffff82146768 in zfsdev_close (data=0xfffff8000822c700)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:197
#12 0xffffffff80a98347 in devfs_destroy_cdevpriv (p=0xfffff8051eff9b40)
    at /usr/src/sys/fs/devfs/devfs_vnops.c:197
#13 0xffffffff80a9bf64 in devfs_fpdrop (fp=0xfffff807882306e0)
    at /usr/src/sys/fs/devfs/devfs_vnops.c:211
#14 devfs_close_f (fp=0xfffff807882306e0, td=<optimized out>)
    at /usr/src/sys/fs/devfs/devfs_vnops.c:787
#15 0xffffffff80b98d2b in fo_close (fp=0xfffff807882306e0, 
    td=0xfffffe01e6a02300) at /usr/src/sys/sys/file.h:377
#16 _fdrop (fp=fp@entry=0xfffff807882306e0, td=td@entry=0xfffffe01e6a02300)
    at /usr/src/sys/kern/kern_descrip.c:3510
#17 0xffffffff80b9c5e9 in closef (fp=fp@entry=0xfffff807882306e0, 
    td=td@entry=0xfffffe01e6a02300) at /usr/src/sys/kern/kern_descrip.c:2828
#18 0xffffffff80ba0697 in closefp_impl (fdp=0xfffffe01ef4134f0, fd=5, 
    fp=0xfffff807882306e0, td=0xfffffe01e6a02300, audit=true)
    at /usr/src/sys/kern/kern_descrip.c:1271
#19 0xffffffff8108827e in syscallenter (td=<optimized out>)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#20 amd64_syscall (td=0xfffffe01e6a02300, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1156
#21 <signal handler called>
#22 0x00000008007bb40a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe9c8
(kgdb) 

This problem looks a lot like this TrueNAS problem, which - according to the bug history - should be resolved in OpenZFS 2.0.4 (13.0-RELEASE uses 2.0.0 - 13.0-STABLE, which I didn't have time to test, uses OpenZFS 2.1.x).

The main contributor of zfs invocations is pot get-rss. The more containers are running, the more calls to it are made (one per second per pot) and it actually also seems like that the more containers are running, the more likely the crash is (I didn't manage to find a reduced test-case that crashes it in a shorter amount of time though, so maybe something else has to happen parallel).

As a workaround (which also reduces load on the system in general), I applied the following patch to get-rss.sh:

--- share/pot/get-rss.sh.orig	2021-10-31 11:28:18.000000000 +0000
+++ share/pot/get-rss.sh	2022-01-02 13:13:17.066895000 +0000
@@ -70,8 +70,8 @@
 		get-rss-help
 		${EXIT} 1
 	fi
-	if ! _is_pot "$_pname" quiet ; then
-		_error "The pot $_pname is not a valid pot"
+	if ! _is_pot_running "$_pname" ; then
+		_error "The pot $_pname is not running"
 		${EXIT} 1
 	fi
 	if ! _is_uid0 ; then

It's yet to be seen if this really helps, but if seems logical, given that nothing else calls zfs on a cluster that is more or less idling with a couple of deployments that aren't modified.

Also, it seems like the cause of the panic might be resolved in this commit:

I was able to reproduce the problem (crash the host within seconds) and could also confirm that applying openzfs/zfs@f845b2d fixes it.

I created a PR in FreeBSD's bug tracker documenting the problem and the solution for 13.0-RELEASE.

Copied from there, these are the steps to reproduce/fix:

$ cat >crashme.c<<EOF
#include <unistd.h>
#include <sys/stdtypes.h>
#include <libzfs_core.h>

int main(int argc, char** argv)
{
  fork(); fork(); fork(); fork();
  for (int i=0; i<1000000; ++i) {
    libzfs_core_init();
    lzc_exists(argc >= 2 ? argv[1] : "zroot");
    libzfs_core_fini();
  }
}
EOF

$ cc \
  -I/usr/src/sys/contrib/openzfs/include \
  -I/usr/src/sys/contrib/openzfs/lib/libspl/include \
  -lzfs_core -lzfs -o crashme crashme.c

$ ./crashme zroot

This doesn't require root privileges.

Applying the patch mentioned above fixes the problem:

# cd /usr/src/sys/contrib/openzfs
# fetch -o - \
  https://github.com/openzfs/zfs/commit/f845b2dd1c60.diff | patch -p1
# cd /usr/src
# make -j8 kernel
# reboot
...
$ ./crashme zroot && echo "I'm ok"
I'm ok
$