famzah/popen-noshell

Portability option to use vfork()? (or posix_spawn())

Closed this issue · 67 comments

Portability option to use vfork()? (or posix_spawn())

You can't capture the output of the child process if you use vfork(). More info here: https://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/

Am I missing something?

(I should have encouraged to use posix_spawn() rather than vfork(). It's a bit of a bear (EDIT: a bear to code to, not a bear performance-wise; but on the plus side you don't have to worry about what is safe to do on the child side), but very portable, and most implementation use or support vfork. Glibc's posix_spawn() supports a POSIX_SPAWN_USEVFORK flag, for example (it uses fork() by default, which is bad for the reasons you give).)

Your post is wrong as to vfork(). I hope this doesn't come across as patronizing.

The child of vfork() can very much change its file descriptors without affecting the parent. Indeed, there are vfork-using implementations of posix_spawn() that work just fine. From the Linux vfork manpage:

       As with fork(2), the child process created by vfork() inherits copies
       of various of the caller's process attributes (e.g., file
       descriptors, signal dispositions, and current working directory); the
       vfork() call differs only in the treatment of the virtual address
       space, as described above.

(The standard's vfork() manpages also make this clear. Search for "open group vfork".)

With vfork(), or vfork-using-posix_spawn(), you can setup the child's stdout to be the write end of a pipe you can read from in the parent, just as you do with clone(). But you get two benefits:

  • you don't have to allocate a stack for the clone (since the parent thread and child process will share the parent thread's stack)

  • the result is more portable (Solaris, *BSD, and others, have vfork(); only Linux has clone().

There might be downsides to using vfork(). IIRC some OSes (but not Linux) stop all threads in the parent process, not just the one that called vfork(), but that's a silly mistake -- only _exit(), other process death, and exec system calls in other parent threads must be held for the vfork() child to exex-or-exit (bad idea: the child could keep the parent process from dying), or perhaps the child must be allowed to continue anyways? or perhaps it must be killed. But of course, those OSes don't have clone() anyways, so it's not like this is a downside you need to concern yourself with too much :)

Also, either way you must use only async-signal-safe library functions on the child side of vfork() or your clone() call. No malloc() or free() calls, for example; no exit() either. No *printf()s. But dup2(), close(), write(), _exit() -- these are safe.

Interesting... :)

The first paragraph of the vfork() man page says the following:

   Standard description
       (From POSIX.1) The vfork() function has the same effect as fork(2),
       except that the behavior is undefined [1] if the process created by
       vfork() either modifies any data other than a variable of type pid_t
       used to store the return value from vfork(), [2] or returns from the
       function in which vfork() was called, [3] or calls any other function
       before successfully calling _exit(2) or one of the exec(3) family of
       functions.

Is the man page wrong about [1], [2], or [3]?

The Open Group man page confirms the above statements:

The use of vfork() for any purpose except as a prelude to an immediate call
to a function from the exec family, or to _exit(), is not advised.
...
Care should be taken, also, to call _exit() rather than exit() if exec cannot be used, 
since exit() flushes and closes standard I/O channels, 
thereby damaging the parent process' standard I/O data structures.

Do you have sample code to demonstrate what you suggest?

Yes, they're both wrong that you cannot call any other functions than _exit() or an exec-family function. Historically vfork() is very dangerous because you can easily corrupt the stack or heap of the parent, so the restrictions on what the child can do are a bit heftier than merely not calling any non-async-signal-safe functions. But dup2() and such are very much safe. The reason the manpages don't say this is that they want you to use posix_spawn() instead whenever you might want to use vfork(), and the reason for that is that the OS developer can decidedly know what is safe and what isn't. But the unsafety concerns are overwrought, and plenty of software uses vfork() in much the same way that implementations of posix_spawn() do.

If you look at glibc, it supports using vfork() in its posix_spawn() implementation. Illumos' (erstwhile OpenSolaris) posix_spawn() use vforkx(), a version of vfork() that allows variant behavior as to how the child's death will be handled in the parent, but which is otherwise identical to vfork(). FreeBSD's posix_spawn() uses vfork() as well. Livast (part of ksh93) uses vfork().

Now, if you want to steer clear of vfork() because of the warnings of undefined behavior, your clone() usage is very similar to vfork() anyways, with the only safety you gain relating to what the child can do to its stack. What you really want, if you want safety, is to use posix_spawn(), and in the case of glibc you'll want to set the POSIX_SPAWN_USEVFORK flag so that you get the copy-on-write savings that you're after.

Links:

http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libc/port/threads/spawn.c#304
http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libast/common/comp/spawnveg.c#198
http://src.illumos.org/source/xref/freebsd-head/lib/libc/gen/posix_spawn.c#204
http://code.metager.de/source/xref/gnu/glibc/sysdeps/posix/spawni.c#106

I should add that fork() itself is very unsafe!

Lots of libraries break if you use them, fork(), then continue using them in both the parent and child.

In general the only way it can be safe to do anything other than having the parent exit, or the child exec-or-exit soon after the fork, is to fork so early in the process' life that no libraries can have fork-unsafe behavior.

Personally, I find process forking to be rather evil for this reason. I much prefer the spawn approach.

Years ago, before I had a taste of fork-unsafety, I used to think that the Unix approach was so much better than the Windows approach. Now I'm in the fork-is-evil-use-spawn-instead camp. However, you're basically building a spawner, and so here fork() and vfork() are OK in my world-view :)

(In my experience, writing fork-safe code can be much harder than writing thread-safe code. Particularly fixing fork-unsafe code to be fork-safe.)

Also, long ago BSD removed vfork() for a while because they felt that it was terribly difficult to use safely, but that was a bunch of hogwash. That's where the instill-fear approach to vfork() comes from.

posix_spawn() is the compromise. You'd do well to use it, and you'd get even more portable code. But again, I'd not begrudge you using vfork() -- it's barely less safe than clone(). Even with clone() as you call it you must certainly not call non-async-signal-safe functions on the child side -- that the clone(2) manpage doesn't say so doesn't say so doesn't make it safe to call such functions on the child side.

The history of vfork() and clone() is important to understanding the ways in which both manpages are incomplete or inaccurate. But using posix_spawn() definitely puts you into squarely safe and portable territory.

I should also add that vfork()'s first and most famous user was the C-Shell (which no one should use, but for reasons other than its use of vfork()), which, like any shell that supports redirection, very much has to call dup2() (or fcntl(), or dup() -- I think this predated dup2()) and close(). The C-Shell pretty much still works that way. vfork() basically hasn't changed at all in the last 35 years.

The only interesting wrinkle in the (non-)evolution of vfork() is what to do about other threads in the parent process. There's two options: let them keep running, or stop them. If you let them keep running then you have to decide what happens if any of them causes the parent process to die, exit, or exec: do you kill the stopped vfork() parent too? or do you hold up the parent's death/exec in order to allow the child of vfork() to _exit() or exec()? There have been other less interesting (perhaps) wrinkles like: what if the child of vfork() calls fork() or vfork() before execing? But all implementations of vfork() allow the child to complete setting up I/O redirection -- the C-Shell, ksh, posix_spawn(), and others, all would break otherwise.

Oh, also from the Linux manpage:

A call to vfork() is equivalent to calling clone(2) with flags specified as:

CLONE_VM | CLONE_VFORK | SIGCHLD

which together with this from the clone(2) manpage:

CLONE_VFORK (since Linux 2.2)
    If CLONE_VFORK is set, the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).

    If CLONE_VFORK is not set then both the calling process and the child are schedulable after the call, and an application should not rely on execution occurring in any particular order. 

should clarify things further.

Lastly, from the Linux manpage of vfork(2):

History

The vfork() system call appeared in 3.0BSD. In 4.4BSD it was made synonymous to fork(2) but NetBSD introduced it again, cf. In Linux, it has been equivalent to fork(2) until 2.2.0-pre6 or so. Since 2.2.0-pre9 (on i386, somewhat later on other architectures) it is an independent system call. Support was added in glibc 2.0.112. 

which means that you want to use clone() on Linux if you want to run on very old Linux kernels / glibcs.

I should also add that the *BSD manpages for vfork(2) are not nearly as scary as the ones from Linux or Open Group are. BSD, of course, is the source of vfork(2) in the first place.

I'm researching this and need more time... I'm amazed by your expertise in the *fork() functions and implementation details. :)

In the meantime, I have a few comments/questions.

(1) It seems like "glibc" completely switched posix_spawn() to always use vfork() ?

(2) I don't use CLONE_VFORK but all other implementations seem to use it. Do you know why it's important to suspend the parent process until the child releases its virtual memory resources via a call to execve(2) or _exit(2) ? (in the context that I use it here as a spawner)

  • Is it related to what you quote above in regards to the "interesting wrinkle in the (non-)evolution of vfork()" ?
  • Is this a problem only in multi-threaded applications?

(3) How do you know that async-signal-safe functions are safe to be used within a vfork() child process?

  • Those "libc" functions are guaranteed not to modify memory used by the parent (as explained here) ?

(1) would not surprise me.

Regarding (2)... In the beginning... there were no threads, only single-threaded processes. fork() copied the parent's heap and stack (copy-on-write, mmap(), all came later), and set both processes to run. vfork() makes the child borrow the parent's address space, and the child runs in the same stack, in the same stack frame as the parent. Obviously two threads can't possibly run at the same time on the same stack! So the parent process was stopped, the child was started, and the parent wouldn't run again until the child either exited or exec'ed.

Fast-forward a bit and we have all sorts of new things, like COW and threads. COW made vfork() seem unnecessary at first, but later it turned out that vfork() is still a huge performance win over fork().

But what about threads? Well, I don't see any reason to stop any threads in the vfork() parent other than the one that called vfork() -- that one thread must be stopped because you can't have two threads sharing a stack. Now, the original vfork() manpage said that the parent process is stopped, so I suspect that that's the only reason any implementor would make vfork() stop all other threads in the parent.

My take is that vfork() should only stop the one parent thread that called it. Consider a typical heavy-weight multi-threaded process, a JVM, say. A bunch of threads may be servicing clients, others may be waiting for work, one may be trying to spawn a new process: why should threads other than the one spawning a child have to stop? I can't think of any useful code one could write where that would be desirable. Stopping any other threads is a bug, but, whatever, it's still better than fork()!

Regarding (3)... it's already the case that it's not safe to do certain things in the child-side of plain fork(). For example, using stdio at all. Consider this:

    fflush(stdout);
    fflush(stderr);
    pid = fork();
    if (pid == -1)
        err(1, "couldn't fork");
    if (pid == 0) {
        printf("Child process here\n!");
        fflush(stdout);
        _exit(0);
    }
    printf("Parent process here\n");

If there are other threads in the parent writing to stdout via stdio, this is not safe!

Calling malloc() on the child-side of fork() is probably ok, but it's probably not OK on the child-side of vfork() (can the child block on a mutex held by another thread in the parent?!).

POSIX talks about async-signal-safe functions: functions that are safe to call in handlers for asynchronous signals. These are typically just system calls.

vfork() adds some safety constraints. For example, it would be very bad for the parent if the vfork() child returned out of its function frame! So you have to be careful. The thing to do in an implementation of something like posix_spawn(), or your popen alternative, or a shell, is to set everything up so that you can call vfork() and have the child call nothing other than dup(2) and close(2) (for I/O redirection, though you should rely on O_CLOEXEC for closing), then exec*() and _exit(). You can call write(2) on the child side though, certainly, and a few other system calls too (see below re: signals).

Now, to answer the question in the SO you linked, in the case of vfork() the child should simply write the errno from exec in a local variable, which the parent can check after it resumes. If you must use fork(), you can use a pipe instead.

The thing I typically do for daemonizing portably is this:

  • call pipe(2), make it cloexec
  • call fork() (more portable than vfork(), and if we're daemonizing then we're a small process anyways and fork() is fine
  • the child then creates listening sockets, whatever, and when it's ready writes to the write end of the pipe
  • the parent blocks read()ing from the pipe; if EOF -> child died in a fire, else the service is ready and the parent can exit.

Now, let's talk about this. The bit about set*id() is silly for many reasons, and just scaremongering. The thing that's wrong about multithreaded processes calling set*id() and vfork() is multithreading: even if you take *fork() out of the picture you still have problems. (Mind you, on WIN32 it's possible to run different threads in one process with different "access tokens", which means: as different users. This is strange but not too strange, as Unix kernels have the same situation all the time. One of the biggest problems with allowing multiple threads to have different credentials is that .init sections in shared objects might run with unexpected credentials. But let's ignore WIN32 for now.) The thing to do about multithreading and set*id() is to not mix them! As to the situation with signals, that was real and not really an issue any more, as described in that article, but it is an issue for you if you choose to use vfork(), but also if you use clone(), so be careful!

I should add that there's only two good things to do in async signal handlers (synchronous signal handlers are another story):

  • write to a sig_atomic_t variable (e.g., to have a SIGUSR1 handler that increments a verbose flag)
  • write(2) to a pipe that is polled for (with epoll(2), poll(2), select(2), event ports, or anything else of the sort)
  • EDIT: it's also safe to check getpid() to decide not to do things in child processes...

The above are safe even if you fork() or vfork().

ALSO, calling open(2) in a signal handler is bad whether you're using fork() or vfork(), since both cases can yield unexpected results, so the scaremongering in http://ewontfix.com/7/ is just lame all around, and a very good example of how people incorrectly argue that vfork() is evil, when the real evil is _fork()_!!

fork() is quite evil. There are many, many problems with it. Spawning is a much better approach. posix_spawn() is much better than fork(), even if internally it uses fork() (but you hope it uses vfork() instead!).

Just say no to fork()!

Async signal handlers too are evil. About the only things worth doing with them are writing to sig_atomic_t variables or to pipes so as to turn async signals into async events handled like any other I/O event. Indeed, there is signalfd(2), but write(2)ing to a pipe is universally portable among the Unix and Unix-like OSes, so do that.

But vfork()? Meh, it's easy enough to use, provided you do those things which I suggested above and earlier. Is it difficult or dangerous to use vfork()? Only if you don't read the docs. Of course, the docs often tell you it's evil and dangerous, and they fail to discuss the correct and safe use of vfork(), but that's only because the vfork()-is-evil meme goes a long time. What a destructive mistake though. Because of 4.4BSD's removal of vfork() it has taken decades to get a JVM that doesn't suck when spawning children.

Thanks for the praise earlier. I was afraid I'd turned you off by appearing to be a know-it-all.

Anybody calling open(2), connect(2), socket(2), and such in an async signal handler deserves what's coming to them :)

If fork() is evil, why does it exist? Well...

In the beginning it was less evil because there were no threads, signals were mostly fatal, and every program was simple.

And in the beginning fork() made it very easy to hack on a shell: it's just user-land code, so there's no reboot cycles in the debug cycle.

A trivial spawn*() with the same signature as exec*() would have required having a helper program to finish up I/O redirection and such, with instructions passed in via a pipe or arguments. But such a program could have been tiny and cached forever, making spawn() much, much faster than fork()ing. And it would have been a user-land helper program, so easily hacked on.

So even in the beginning fork() had at least a tiny bit of evil: the unnecessary copying of the parent address space. fork() is originally evil, you could say!

Mind you, fork() is also very nice in some cases. For example, we use it in Heimdal's KDC to implement a multi-process service. If you fork() early in a process' life, it's safe enough and not very evil.

Even now you could use a helper program to avoid the signal handler races (one of the linked things talks about this).

To use a helper, if you want to (I don't think it should be required, but it's kinda nifty!), just:

  • don't use O_CLOEXEC (the helper program will close those things) for any intermediate fildes
  • pass the list of actions (redirections, setpgrp(2), setsid(2), ...) either as an argument to the helper, env var, or in a data structure stored in a pipe or unlinked open file
  • prepend the helper to the argument list
  • vfork(), exec() the new argument list, record the errno from exec*(), _exit(2)
  • the helper should recover and decode the actions list, perform the actions in order, then exec*(2) the real program by "popping" itself from the argv[].

In order to properly pass errno from the exec in the helper back to the parent, use a pipe which the helper will set to be O_CLOEXEC. The parent should read from the pipe as soon as vfork() returns; if EOF then the exec call in the helper succeeded, else either it failed or the exec of the helper failed, and you can check a local variable to see which was the case.

Of course, it's easier to use your implementation if it doesn't need a helper :) So don't use a helper. Don't worry too much about the signals race.

Nico, thanks for all your efforts to explain everything in details. I really enjoy this productive conversation.

Duh, having the same stack requires the parent thread to be stopped, right :) An interesting fact is the current posix_spawn() implementation in "libc" -- they stop the parent, even though the clone()'d child has its own stack. That's because they don't use CLONE_SETTLS, in order for the child process to have its own TLS like "errno", etc.

I'm starting to find too many flaws in my current implementation :) Am I doing anything right?

  1. I don't stop the parent process and thus "errno" and possibly other global variables could be overridden if the redirection-setup-phase fails, or if the exec() fails?
    • If I use CLONE_SETTLS, is this issue resolved completely?
    • I want to avoid the parent halt at all cost; it just doesn't make sense and seems like a performance issue if you spawn a lot of processes.
  2. The signal handlers race-condition is not handled at all. Do I need to stop the parent for sure, in order to handle this properly?
    • It looks so, because the parent must temporarily block the signals before calling clone() or vfork(), as the current posix_spawn() implementation in "libc" does.
  3. As explained here, if I switch to vfork(), this could (unexpectedly) run any "atfork" handlers?
  4. Multi-threaded applications must be extra careful, especially with setuid() calls and its friends.
  5. Are there other problems that you spot?

P.S This with the helper program is an interesting approach but probably has some performance penalty due to the double exec()? Additionally, how does it solve the signals race condition? It seems like the time window gets smaller because we do much of the work in the helper, but still the race condition exists between the vfork() and the exec() of the helper?

(1) No idea. I'm not an expert on Linux's clone(2). But it's good you caught these issues too!

As to (1b), I'm not sure that vfork(2) stops all other parent threads on Linux. You might want to test it. I don't know which OSes' vfork(2) do and don't stop all other parent threads, but I believe Solaris' does. Clearly, stopping just the thread calling your function is just fine: there's nothing for it to do anyways other than wait for the child to indicate readiness/outcome, so not stopping it is not much of an optimization.

You can always use posix_spawn() and wash your hands of the mess, point to the libc/OS maintainers/vendors, whatever, maybe also the Open Group, and be done. There is wisdom in this, though, of course, you might be able to do better on Linux, and evidently you want to! :)

(2) The signal handler race is about signal handlers running on the child-side of *fork(2)/clone(2) and which do things like open(2), close(2), dup(2), and so on. Those change the FD table on the child side in ways that might be unexpected. Now, if anyone does this in a signal handler, they should get what they deserve. So I really would not worry about the signal handler race condition, really. If you care to, however, you can basically block all signals before forking/cloning, then reset signal handling on the child side for all signals, then unblock signals before proceding to exec. That's a lot of work that I don't think is justified here.

(3) That is just glibc being fucking broken as fuck. Excuse my French, but there's no polite way to put it.

glibc's posix_spawn() should always have used clone(2)/vfork(2), but instead used fork(2), and so they always violated POSIX as to the atfork handlers! Now that they wised up and started doing something better, now they care about compatibility. So first they break all code that uses posix_spawn(), and now they worry about breaking code built to work with glibc. Great going guys!!!

IMO glibc should just stop calling atfork handlers in posix_spawn(), full stop.

(4) Multi-threaded applications should NEVER call set*id(2), full stop. (Early on, before starting threads, sure, but never after.) (4) is a non-issue, and has nothing to do with an fork-like or clone-like system call, and anything else anyone tells you is wrong and scaremongering.

There's nothing you could possibly do to protect against another thread doing a setreuid(2) or other set*id(2) call behind your back. The best you could do is check that the IDs are the same on the child side as they were on the parent side when your function was called, but that doesn't mean there's no race in the whole parent. Suppose a thread in the parent does a setreuid(2) call then a setregid(2) call, then setgroups(2), and another thread calls your function, and somewhere in the middle of the first thread's sequence of calls you start the child and it has the same credentials as you got on the parent side when you started... Is that OK? No. It's not. What could you do to protect against that? Nothing. So it's just not your problem.

(5) I've not reviewed your code, honestly, not further than to notice that it's not portable :)

Also, anything that uses the GPL is basically not interesting to me. LGPL is OK in some cases. I prefer BSD for just about everything, though various community licenses work for me too. So why am I helping you? I'm just providing friendly advice! There aren't many better-than-popen projects, so I think yours could use a hand, and here I am giving it.

BTW, I tested vfork() on Linux. It does not stop all other threads in the parent.

My test has the main thread start two threads, one printing "worker here\n" every second, while the other calls vfork(), prints "child here\n" on the child side and waits 10 seconds, then exits; the parent prints "parent here\n" then exits. This is the result:

$ ./vfork-test
worker here
child here
worker here
worker here
worker here
worker here
worker here
worker here
worker here
worker here
worker here
parent here

So, in fact, there's absolutely no reason that you should not use vfork(2), if you choose not to use posix_spawn(), which you might because of the atfork handler business. Though, honestly, I'd not worry too much about the atfork handlers -- just point the finger at glibc (and laugh, and laugh, and cry, and cry).

EDIT: And on non-Linux systems, whether vfork(2) stops all thread or just the calling thread does not matter: even if that happens fork(2) is still worse. So just use vfork(2) or posix_spawn() and stop monkeying with clone(2). IMO :)

Test program attached: vfork-test.c.

I'm thinking I should take all my commentary and write up a blog post titled "fork() is evil, vfork() is not". Or something like that.

Oh, and I misunderstood one of your questions. vfork(2) does NOT run atfork handlers. It would be very weird if it did since an atfork handler that works with fork(2) can easily blow up with vfork(2). The man page says so. E.g., http://man7.org/linux/man-pages/man2/vfork.2.html says:

   Linux notes
       Fork handlers established using pthread_atfork(3) are not called when
       a multithreaded program employing the NPTL threading library calls
       vfork().  Fork handlers are called in this case in a program using
       the LinuxThreads threading library.  (See pthreads(7) for a
       description of Linux threading libraries.)

The Open Group man page for vfork(2) does not mention pthread_atfork() at all, whereas the Open Group man page for fork(2) does. And so on.

Also, regarding the helper program idea... A very small helper program will be cached in memory. The exec system calls should be very fast by comparison to fork(2), so you probably won't notice.

As to signals... yes, there remains a race condition if you have signal handlers that can affect the outcome of dup2(2) and close(2) calls done in the child-side of vfork(2), but again, I don't think much of signal handlers that could do that, or the people that write them!

Actually, dealing with signals seems not so bad anyways. Block them in the parent, reset signals with handlers to SIG_DFL in the child, restore the old mask in the child (and, later, in the parent).

I think we've cleared it up almost entirely. Only one thing still bugs me a little. Why would you state that "stopping just the thread calling your function is just fine: there's nothing for it to do anyways other than wait for the child to indicate readiness/outcome".

Some applications do extensive process spawning, like Nagios for example, which runs many parallel checks via external binary programs. I have an instance which spawns thousands of processes per minute. Stopping the parent process for the short time between vfork() and the exec() introduces unneeded latency in the parent process. The best solution is that the parent initiates the spawn of a process, gets the file descriptor to communicate with the child process, and continues with its tasks. It then regularly polls via select() or similar calls whether this child has something to say, and/or if it died by receiving an EOF in the pipe. You can poll many such file descriptors in parallel from the parent process.

P.S. I must admit that the vfork() approach is much more standard and well tested.

Oh, I agree that the calling thread in the parent could just asynchronously handle the child's readiness/error message, and indeed, everything should be all async all the time. That's a bit of a mantra for me, and yet I missed the possibility that the parent thread might not have to wait synchronously :( Good on you to think of it!

If you can get semantics like that with clone(2) then you're golden, though obviously your API needs a way to hook into an event loop, or you must run your own (in a helper thread) and then have a completion callback.

Of course, a synchronous spawn API that's very light-weight could easily be run from a thread anyways to turn it into async. vfork(2) makes it fairly light-weight, so you can easily provide a vfork(2) based version of your async API that starts a thread with a very small stack (because it won't need much; the stack is one of the most expensive aspects of a thread), calls the synchronous version, and posts the event completion / calls the callback when done and exits the thread.

So with clone(2) you might be able to optimize away a thread create/exit, which is... nice!

Oh right! This is a popen type API. There's no need to do anything fancy for async behavior: just delay error processing until the caller cleans up. After all you get a pipe to read from (you'll get EOF if the spawn fails) or to write to (you'll get EPIPE/SIGPIPE if the spawn fails). Fantastic.

OK, I have now looked at the code a bit, and here, then are the minimal changes you should make:

  • Add CLONE_TLS to the clone(2) call
  • Don't free(stack) at 286 (i.e., in the child). If that's there for valgrind, find another way to quiet it.

For portability, use #ifdef to use clone(2) on Linux, vfork(2) if it's available, or fork(2) otherwise.

I'd gladly let you post this on my blog as a guest author, if you don't find a better place :)

I'll only rather call clone() "too hackish".

I have a blog, but thanks. If you'd like to link to this, that'd be great. I've added more text, and a partial avfork() implementation on top of vfork().

Can I have the URL address, please.

Admittedly I've not updated it in ages.

@nicowilliams, I've addressed everything we've discussed here.

The "free(stack) at 286 (i.e., in the child)" is valid, because if Valgrind "mode" is enabled by the macros, we specifically always use fork() and this "stack" is not used in the child process. So it's safe to free() it.

I didn't implement the temporarily block of the signal handlers in the parent process, because this use-case is rare. Instead, I put a note in the README.

Thank you for everything.

@nicowilliams, it seems that I spoke too early. CLONE_SETTLS didn't work as expected and more-or-less I'm using vfork() for the library. What a sad day for the spawn efficiency on Linux :)

More info at: 64c51f5

On the bright side, I implemented posix_spawn() as a possible mechanism, and once glibc 2.24 becomes widely used (very soon), my library will be useful mainly as an example on how to use posix_spawn().

I've left comments at 64c51f5. You're awesome. Thanks for implementing this!

BTW, glibc barely mentions CLONE_SETTLS... In sysdeps/unix/sysv/linux/spawni.c it says there's no need to use it because it's using CLONE_VFORK -- there may be some history there.

In sysdeps/unix/sysv/linux/createthread.c it uses it with a bunch of other flags:

  const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
                           | CLONE_SIGHAND | CLONE_THREAD
                           | CLONE_SETTLS | CLONE_PARENT_SETTID
                           | CLONE_CHILD_CLEARTID
                           | 0);

There's no other reference.

Incidentally, if you called only system calls via syscall(2) in the child-side of a clone() without CLONE_SETTLS then you can avoid the need to have a separate TLS. Another possibility is to only call one system call on the child-side: execve(2), via syscall(2), and to exec a wrapper program which will then apply any actions needed prior to execve(2)ing the real program.

I think I found the issue with CLONE_SETTLS. See this comment.

http://stackoverflow.com/questions/10060684/how-to-allocate-a-new-tls-area-with-clone-system-call

If you are planning to use libc in your newly-created thread, you should abandon your approach: libc expects TLS to be set up a certain way, and there is no way for you to arrange for such setup when you call clone directly. Your new thread will intermittently crash when libc discovers that you didn't set up TLS properly. Debugging such crashes is exceedingly difficult, and the only reliable solution is ... to use pthread_create.

I think you're right. We should file a ticket with the glibc people for a new feature or for making posix_spawn() use a non-vforking clone() call.

Another thing I'm seeing is that the modify_ldt(2) and set_thread_area(2) system calls can be compiled out, and in particular that some container systems do so.

I think you're right. We should file a ticket with the glibc people for a new feature or for making posix_spawn() use a non-vforking clone() call.

I believe I found the reason why they didn't do it completely parallel. Here is a snippet from "sysdeps/unix/sysv/linux/spawni.c":

/* The Linux implementation of posix_spawn{p} uses the clone syscall directly
   with CLONE_VM and CLONE_VFORK flags and an allocated stack.  The new stack
   and start function solves most the vfork limitation (possible parent
   clobber due stack spilling). The remaining issue are:

   1. That no signal handlers must run in child context, to avoid corrupting
      parent's state.
   2. The parent must ensure child's stack freeing.
   3. Child must synchronize with parent to enforce 2. and to possible
      return execv issues.

   The first issue is solved by blocking all signals in child, even
   the NPTL-internal ones (SIGCANCEL and SIGSETXID).  The second and
   third issue is done by a stack allocation in parent, and by using a
   field in struct spawn_args where the child can write an error
   code. CLONE_VFORK ensures that the parent does not run until the
   child has either exec'ed successfully or exited.  */

The parent must free the allocated stack for the child. But the usage interface of posix_spawn() doesn't have a pclose() call like popen() has (which could do clean-up operations). The call to posix_spawn() returns a PID, and we call waitpid() for it in the parent. There is no call to free the allocated stack.

That's why the developers of glibc chose to suspend the parent process until the child has exited. At least that's my version of "why they did it". Do you agree?

@famzah (2) seems silly because it seems like the same problem must come up as to detached threads. Whatever works for detached threads, should work for a posix_spawn() that uses clone() without CLONE_VFORK. The best thing to do is delay cleanup in case we can reuse the stack.

(3) is a bit more likely: how does the parent process know that the stack for that new child is no longer in use because the child exec'ed? This is very tricky! Normally I would use a pipe with O_CLOEXEC set -- when EOF on the read-side, the child exec'ed or exited....

...But that's not appropriate for a library because it's racy: between the point where the pipe2(2) syscall completes and the parent closes the write-side of the pipe after the clone(2) call returns to it, other threads could have called fork(), and so we might still have an open file reference for the write side of the pipe! We might need more help from the kernel to know when the child exec()'ed correctly. Though I can think of other ways to do this without more help from the kernel. For example:

  • the parent could have a socketpair that the client uses to send it the read side of an O_CLOEXEC pipe -- no race conditions here, but more syscalls :(
  • or when the parent wants to reuse the stack it could check /proc/<child>/stat's starttime to see if it has changed (but I don't know if that's process forked time or process exec'ed time), but again: more syscalls :(
  • ahhh! There's a syscall called kcmp(2) since 3.5 (so... a bit new) that you can use to tell if the child is still sharing the address space of the parent. You'd use this when you want to reuse a stack, and atexit().

Still, those two methods should work asynchronously for a library like yours. But the first one won't do for a C library proper because it requires keeping more file descriptors open across library functions. The second one might not work at all. The third one only works on newer kernels. Maybe something with POSIX or SysV semaphores??

So I agree that (3) is probably the real reason posix_spawn() uses vfork() on glibc.

You could still combine threads and vfork() or posix_spawn() to get higher performance... It's not as good as an async mechanism, but hey.

Another method for solving (3) might be to check /proc/CHILD_PID/fdinfo/SENTINEL_FD where the fd is O_CLOEXEC and refers to a temp file created and unlinked by the child whose struct stat is copied into the parent's address space. I the child no longer has this fd or it is no longer for the same file, then you know it has exec'ed. This requires three more syscalls on the child side and none on the parent side until you want to spawn again and reuse a possibly-free stack (then it takes three more syscalls in the parent).

If the child gains or drops privs during or after the exec, using /proc may fail for the parent, but that's evidence of successful exec, so that's OK.

There is one more thing that could be done about (3) though:

  • use a wrapper executable to write(2) to a not-O_CLOEXEC pipe to the parent, then close(2) it

You said using a statically-linked-with-musl wrapper executable meant an 8% slow-down. But if the parent doesn't have to block you might get higher throughput in spite of the higher latency. It's worth considering.

A statically-linked-with-musl wrapper executable is only 8% slower than the assembler wrapper. But the assembler wrapper itself brings 50% slow-down compared to a plain execve(). While I like the idea and it's very much in the UNIX spirit, its performance is not suitable.

The good thing about my library is that it re-implements popen(). The good thing about popen() is that is requires you to call pclose() :-) This call is the place where I free() the allocated stack for the child process.

If you don't have other comments, I'll ask on the glibc development mailing list if it's viable to run the child with its own TLS in parallel with the parent. And also how to set up a separate TLS structure from the allocated stack.

Ah, right. I was focusing on how glibc might solve (3) in posix_spawn(), but for you, all you need is something like avfork(), and if it's just instructions on how to clone(2) with CLONE_SETTLS and without CLONE_VFORK, that's good enough because of your pclose_noshell().

Yes, I think we're done here and this horse has been beaten to death.

You should definitely send email to the glibc-dev list and/or open a ticket. Other libc's might also care.

(I'm disappointed that execve() of a tiny wrapper program is so heavy-duty. One can imagine a kernel-coded posix_spawn() (NetBSD has one!) that uses a wrapper program to perform posix_spawn actions and which has a pre-made process VM template for this to make exec'ing that wrapper really fast.)

When you post to glibc-dev, do, of course, send them a link to this issue. Also, you might want to cc' me -- I would like to follow the thread, and maybe participate.

Hmm, which list exactly would it be? See https://www.gnu.org/software/libc/involved.html. I don't see a dev list as such. Would you send to lib-alpha, or glibc-bugs? I might have to subscribe first.

@nicowilliams, to be honest I think we shouldn't get into the dark zone with the undocumented glibc features, at least for now. I have no good use-case and test-case, nor demand for this optimization.

I've tried the idea with the separate helper thread which calls vfork(). And the results are pretty good. 26% slow-down compared to a pure vfork(); 13% comes just from the pthread_create() + pthread_join().

# direct call to "tiny2" by vfork()
famzah@vbox64:~/svn/github/popen-noshell/performance_tests/wrapper$ ./run-tests.sh 
real    0m4.646s
real    0m4.619s
real    0m4.601s

famzah@vbox64:~/svn/github/popen-noshell/performance_tests/threads$ ./run-tests.sh 
real    0m5.779s
real    0m5.835s
real    0m5.854s

# Only pthread() create + join without vfork():
real    0m0.764s
real    0m0.757s
real    0m0.763s

To my surprise, creating the threads "detached" slows things down! Both using the pthread_attr_t *attr feature to create the threads directly "detached", and using pthread_detach() give the same benchmark results, which are slower than the original code which uses pthread_join():

famzah@vbox64:~/svn/github/popen-noshell/performance_tests/threads$ ./run-tests.sh 
real    0m7.131s
real    0m7.190s
real    0m7.140s

The only "special" condition on my test machine is that it has one virtual CPU core, because this allows me to benchmark multi-threaded apps more easily. Maybe running those tests on a single CPU core could be the reason for this unexpected slow down.

I am definitly not familiar with clone and vfork as you guys do, but perhaps you can consider the use of CLONE_CLEAR_SIGHAND to let the kernel reset signal handle table and disable any sort of signal handle by either user or glibc itself, and uses syscall source code to manully invoke syscalls, so that you wouldn't have to CLONE_VFORK to stop the parent at all and doesn't need to worry about cancellation and what-so-ever internal global state of glibc?

On reuse of stack, IMHO close-on-exec pipe can be used to deal with it:

Create close-on-exec pipe before clone(), and close its read end in parent. Then uses select to wait for POLLIN_SET(actually waiting for EPOLLHUP) or uses poll or epoll to wait for EPOLLHUP.

This can be potentially done on another thread or by the main program itself if it also utilizes polling interface.

popen can also writes error msg to the pipe on error.

Hi NobodyXu. I tried to grasp what you said but I'm too much out of context now. Furthermore, the standard posix_spawn() syscall should already be available on any up-to-date Linux distro, as it have been 4 years since it got merged into the glibc. My tests show that posix_spawn() is as fast as my library. Therefore, developers should switch to posix_spawn() which is maintained mainstream and should be more bug free.

Sorry, I didn't know that you have given up on this topic.:D

Hi famzah

Sorry to bother you again, but I have successfully implemented @nicowilliams 's idea on avfork and archieved a more responsive aspawn (source code, benchmarking is done via google/benchmark):

$ ll -h bench_aspawn_responsiveness.out
-rwxrwxr-x 1 nobodyxu nobodyxu 254K Oct  2 15:02 bench_aspawn_responsiveness.out*

$ uname -a
Linux pop-os 5.4.0-7642-generic #46~1598628707~20.04~040157c-Ubuntu SMP Fri Aug 28 18:02:16 UTC  x86_64 x86_64 x86_64 GNU/Linux

$ ./a.out
2020-10-02T15:02:45+10:00
Running ./bench_aspawn_responsiveness.out
Run on (12 X 4100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
Load Average: 0.31, 0.36, 0.32
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_aspawn_no_reuse              18009 ns        17942 ns        38943
BM_aspawn/threads:1             14500 ns        14446 ns        48339
BM_vfork_with_shared_stack      46545 ns        16554 ns        44027
BM_fork                         54583 ns        54527 ns        12810
BM_posix_spawn                 125061 ns        29091 ns        24483

The column "Time" is measured in terms of system clock, while "CPU" is measured in terms of per-process CPU time.

Intro on aspawn

struct Stack_t {
    void *addr;
    size_t size;
};

typedef int (*aspawn_fn)(void *arg, int wirte_end_fd, void *old_sigset, void *user_data, size_t user_data_len);

/**
 * @return fd of read end of CLOEXEC pipe if success, eitherwise (-errno).
 *
 * aspawn would disable thread cancellation, then it would revert it before return.
 *
 * aspawn would also mask all signals in parent and reset the signal handler in the child process.
 * Before aspawn returns in parent, it would revert the signal mask.
 *
 * In the function fn, you can only use syscall declared in syscall/syscall.h
 * Use of any glibc function or any function that modifies global/thread-local variable is undefined behavior.
 */
int aspawn(pid_t *pid, struct Stack_t *cached_stack, size_t reserved_stack_sz, 
           aspawn_fn fn, void *arg, void *user_data, size_t user_data_len);

By returning the write end of the CLOEXEC pipefd, user of this library is able to receive error message/check whether
the child process has done using cached_stack so that aspawn can reuse cached_stack.

It also allows user to pass arbitary data in the stack via user_data and user_data_len, which get copies onto top of
the stack, thus user does not have to allocate them separately on heap or mistakenly overwriten an object used in child process.

To use a syscall, you need to include syscall/syscall.h, which defines the syscall routine used by the child process including
find_exe, psys_execve and psys_execveat.

User will be able to reuse stack by polling the fd returned by aspawn and wait for it to hup.

Advantages and disadvantages

Compare to posix_spawn, aspawn has 3 advantages:

  • aspawn allows user to do anything in the child process before exec.
  • aspawn can reuse stack, posix_spawn can't;
  • aspawn doesn't block the parent thread;

The only downside being that aspawn_fn has to use syscall/syscall.h.
Other than that, I don't see any downsides to my approach.

Example code can be seen here

I added a reference to your project in the README of "popen-noshell". Keep up the good work!

@famzah Thank you and I will keep on improving it :D

Hi all.

I'm sorry if this is wrong thread for my question, but i'll give a try.

What about if i have popen on old libc, and after that i must to wait input to pipe, catch it, and check if the child done - if not - wait for input again.

Because of blocking of child when pipe fd buffer full.

So, the only thing i can found to check if pipes fd buf have data already - using of ioctl().
But, on quite old systems it returns -1.

How can i hack it?

Thanks.

If you mind blocking, you can simply read on that fd.

If you do mind blocking, you can utilize select or mark the pipe as O_NONBLOCK.