OpenFusionProject/OpenFusion

Sandbox Feature

dongresource opened this issue · 3 comments

These two branches are my proof-of-concept implementations of sandboxing on Linux and OpenBSD, respectively.

https://github.com/dongresource/OpenFusion/tree/seccomp
https://github.com/dongresource/OpenFusion/tree/pledge-unveil

The following design decisions need to be considered before we can think about (refactoring and) merging these changes into master:

  1. The seccomp POC is default-permit; should be default-deny
  2. seccomp filtering could cause problems in the long term if glibc starts using new syscalls for basic things
  3. seccomp can't finely limit filesystem access the way unveil() can
  4. pledge() and unveil() are per-process; seccomp is per-thread
  5. The two sandboxes need to be called at different spots, so they can't really be drop-in replacements for each other
  6. Adding more compile-time options to the Makefile might make it a bit less manageable at this point
  7. Should the sandboxes be toggleable after compile-time? At the command line or the config file?

In detail:

(1) The seccomp-bpf implementation is currently default-permit, which is an inadequate way of filtering things for pretty much any purpose (according to OpenBSD filtering wisdom). There's tons of syscalls that are variations of the common ones that aren't caught by this approach, and more could always be added in future versions of Linux, so we'd always be playing catch-up if we wanted to keep the sandbox solid.

I'd want to invert it before we consider merging it, but the trouble is that since we're sandboxing pretty much the whole program (after init), we'd have to whitelist all the syscalls we indirectly use through the stdlib, the STL, libsqlite, etc. This isn't impossible in the short term, as strace can list all system calls the process uses and then I can just whitelist them all; and the full list of Linux system calls is finite at any given time.

(2) The long term issue, however, is that glibc could switch to using different system calls for common tasks under the hood (like it's already switched open() to openat()), and then the issue of playing catch-up would become one of people's servers all crashing randomly sometime in the future, instead of just not being properly sandboxed. Unlike OpenBSD, where pledge() promises are logical groupings of syscalls that would automatically include any new ones added because the kernel, libc and pledge() are developed as a singular whole; in seccomp we just have to list all the relevant syscalls ourselves and hope for the best.

(3) While seccomp is more flexible and allows limited inspection of syscall args to influence filtering, I don't think it lets you compare string arguments so as to emulate unveil(), so I don't think we can restrict open() with seccomp alone. I'll have to research this some more.

There's also a libseccomp library that abstracts filter generation away slightly, but at a glance it doesn't seem worthwhile enough since it doesn't abstract things all that much. Our project policy is to keep external dependencies to a minimum, and I've already implemented seccomp directly, so I think we can safely skip this one. Will look into it a little more though.

The pledge() & unveil() sandbox for OpenBSD, by contrast, is pretty much complete as-is. The only thing I could change is when it first gets invoked. The intended approach for those two interfaces is to invoke them as soon as possible to get rid of things we aren't going to use; and then successively restrict the process further with additional invocations, getting rid of everything we're done with before entering the main loop.

In our case, that would mean two invocations: one immediately after the config file is parsed, and one after init; just before the main loop. We'd be using the same set of pledge() promises the entire time. As for the unveils:

  • Before init, we'd first unveil the database and gruntwork file read-write; and tdata and sql read-only
  • Then after init is done, we'd re-veil sql and tdata, leaving only the DB and gruntwork (and /dev/urandom for bcrypt)

Though I honestly think that would be excessive since we don't start handling untrusted data until we start accepting connections, so it would just be slightly more complex and incompatible for no reason.

(4) & (5) Even if we don't go that route, the two sandboxes still aren't compatible enough to be compile-time drop-in replacements for each other, because pledge() and unveil() operate on the whole process, while seccomp needs to be initialized for each thread (if we want to use it to forbid socket(), bind() and listen() after the servers init but before they start polling). I'd probably drop the whole -DSECCOMP_SANDBOX=1 and __attribute__((weak)) stubs framework thing I tried on the seccomp branch and just have separate function calls in main.cpp; and put the functions in separate files in src/sandbox/. Do Visual Studio and PE/COFF (.exe) files even support __attribute__((weak))?

If we end up getting rid of multithreading, the incompatibility would be lessened, though I don't think we're likely to touch that this soon.

Also haven't researched Chromium's Windows sandboxing docs much at all yet. If we decide that's worth implementing for the few people hosting public-facing OF servers on Windows, the API that sandbox exposes to the codebase would likely also be incompatible with the others.

(6) On the earlier note, how should compile-time control of the sandboxes be done? Between the protocol versions, Windows cross compiling, profiling, debug/release, gcc/clang, static/dynamic libsqlite, ASAN, UBSAN, the sandboxes and the other hardening flags we've been discussing; we've accumulated quite a few compile-time options. At this point people usually start generating their Makefiles with hand-written configure scripts, or even worse, Autotools, CMake or whatever else. I guess we've already got CMake, but I don't want to go all-in on it as the build system. Might settle for a more easily user-editable config.mk that the primary Makefile would source (so it could then contain some extra platform-detecting logic and such); or just keep everything as it is.

(7) Semi-related, should we just forgo making the sandboxes compile-time configurable and always compile the proper one for its architecture, and then make them disableable on the command line and/or in the config file? The config file seems like the go-to option, though I've also been thinking of adding rudimentary argument parsing for things like locating the config file in non-default locations, so I could add that in too perhaps. I think MinGW supports getopt(), but does Visual Studio?

In conclusion, my plans for when I get back to working on this are to:

  • Read up some more on all of this; mostly seccomp
  • Implement default-deny seccomp-bpf filtering
  • Integrate the OpenBSD branch into that one

And then if everything looks good we could probably merge all that in.

The significance of (4) and (5) is significantly lessened, as apparently there's a flag for the seccomp() syscall that makes it operate on the whole process at once. Demonstrated in this commit: dongresource@180b1be

Still, the design of that part is worth thinking about, to decide if we should ditch the weak symbols and how the Windows sandbox would fit in, if any.

The finalized version of the Linux and OpenBSD sandboxes has been merged.

  • The seccomp-bpf sandbox is default-deny and decently constrained
  • The sandboxes are drop-in replacements for each other in terms of API (done purely with the preprocessor, not with weak symbols)
  • Their functionality isn't totally identical, as seccomp-bpf can't limit filesystem access without getting rid of open()/read()/write() and the OpenBSD sandbox doesn't get rid of socket()/bind()/etc, but they're both pretty good nonetheless
  • I've experimented with refactoring the Makefile, but decided to keep it mostly the same for now
  • The sandboxes can be disable at runtime in the config file, or at compile time by adding -DCONFIG_NOSANDBOX to CXXFLAGS
  • It needs to be disabled if compiling with ASAN/UBSAN, as those would require too many broad syscalls to be whitelisted
  • A Windows sandbox is not likely to be doable, as it doesn't seem like it would work without separating the server into privileged and unprivileged processes
  • If a server crashes with Bad system call on Linux at some point in the future, that means more syscalls need to be whitelisted