google/nsjail

Infinite loop when forking the child process with signal ERESTARTNOINTR

sfc-gh-hyu opened this issue · 7 comments

Nsjail is hanging and I enabled the strace to figure out what's going on. Here is the output:

...
8147  write(3, "[D][2021-04-29T00:50:16+0000][81"..., 116) = 116
8147  rt_sigaction(SIGPIPE, {sa_handler=0xaaaac38fe198, sa_mask=[], sa_flags=0}, NULL, 8) = 0
8147  setitimer(ITIMER_REAL, {it_interval={tv_sec=1, tv_usec=0}, it_value={tv_sec=1, tv_usec=0}}, NULL) = 0
8147  write(3, "[D][2021-04-29T00:50:16+0000][81"..., 237) = 237
8147  socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [4, 5]) = 0
8147  write(3, "[D][2021-04-29T00:50:16+0000][81"..., 207) = 207
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)
8147  --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
8147  rt_sigreturn({mask=[]})           = 2114060305
8147  clone(child_stack=0xaaaac3bda6b0, flags=CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD) = ? ERESTARTNOINTR (To be restarted)

It looks like nsjail is receiving a signal ERESTARTNOINTR, which will restart the clone system call. I have searched online a little bit more on why this error code is happening. There seems to be some discussion on this could happen if parent process is allocating too much memory ( which I doubt is the case here since nsjail is pretty lightweight).

I increased the interval timer ( setitimer ) to 5 seconds (I think right now it is hard-coded to 1 seconds in

static bool setTimer(nsjconf_t* nsjconf) {
) and then nsjail is running properly. I am wondering if it make sense to at least exposing a configurable interval timer setting?

Can you try with --disable_clone_newnet. The net namespace is known for causing dead-locks on some older kernels.

@robertswiecki I can try that, but if it's dead lock, then increasing timeout to 5 seconds should not help right?

It was something like process creation/destruction took some amount of time, and that time was proportional to n^2 (number of processes with non-default net namespaces) on certain kernels.

@robertswiecki I am facing this issue on arm machine. Here is my kernel version:

$ uname -r
4.14.209-160.339.amzn2.aarch64

Do you know if a newer version of kernel fix this problem?

Btw, disabling the clone_newnet options does fix the issue.

Using newer kernels should help. But the history of fixing this bug is contrived, with at least a few patches applied to fix this completely. So, I don't know which exactly kernel version is good enough here.

From memory, it has been fixed sometime like a year ago, so maybe 5.4, 5.5

@robertswiecki Sorry to bother you again, I would really appreciate it if you can give me a pointer on the exact issue happening here (maybe some jira ticket or fix hash commit). The reason I am asking is that we use aws graviton in our prod environment. I need to know the exact issue/fix so that I can raise a support ticket with amazon to have them cherry pick the fix into their customized build of linux. Thanks a lot in advance.

The problem history can be found here: http://lkml.iu.edu/hypermail/linux/kernel/1611.1/01589.html

So... maybe this is a fix? Not sure, I'm not a kernel developer myself: http://lkml.iu.edu/hypermail/linux/kernel/1611.1/03591.html

The most knowledgeable person on this problem is probably Eric Dumazet (as per the lkml thread above), if that's important for you, maybe pinging them on their e-mail (visible in commits) and asking for whether they can point you to a correct fix might work here.