servo/gaol

Linux sandbox review

Opened this issue · 20 comments

Linux sandbox review

seccomp.rs:

#[cfg(target_arch="x86")]
const ARCH_NR: u32 = AUDIT_ARCH_X86;
#[cfg(target_arch="x86_64")]
const ARCH_NR: u32 = AUDIT_ARCH_X86_64;
#[cfg(target_arch="arm")]
const ARCH_NR: u32 = AUDIT_ARCH_ARM;
// ...
const __AUDIT_ARCH_64BIT: u32 = 0x8000_0000;
const __AUDIT_ARCH_LE: u32 = 0x4000_0000;
const AUDIT_ARCH_X86_64: u32 = EM_X86_64 | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_LE;

Can you explain what these are for?

This needs a comment:

/// Syscalls that are always allowed.
static ALLOWED_SYSCALLS: [u32; 22] = [
    ...
    NR_unknown_318,
}

I'd like to whitelist advice values for madvise. Linux has a habit of adding weird features to madvise that are not really "memory advice".

A quasiquote macro for BPF programs would be pretty sweet ^_^

Can we whitelist socket ops by address family, too? You can do a lot of weird stuff with sockets — DBus, kernel crypto APIs, and even more obscure things where kernel 0days are likely to hide.

I'd also like a test that tries every forbidden syscall number, to guard against (some) bugs in the BPF code.

Is there a BPF disassembler we could use to look at some of the generated programs?

Can we whitelist socket ops by address family, too?

nm, I see you're doing that already:

// Only allow Unix, IPv4, IPv6, and netlink route sockets to be created.

Why do we need to allow route sockets? Are they used by unprivileged processes to query the routing table?

let src = CString::from_slice(b"tmpfs");
...
let tmpfs = CString::from_slice(b"tmpfs");

These could be the same variable, right?

mount(src.as_ptr(), dest.as_ptr(), tmpfs.as_ptr(), MS_NOATIME, ptr::null())

How about MS_NOATIME | MS_NODEV | MS_NOEXEC | MS_NOSUID just for good measure? Some of those could cause problems, though.

Is there code to close unwanted file descriptors before entering the sandbox? It would be good to iterate over all fd's in case some random library has opened something.

Everything else in gaol/platform/linux looks good to me.

Oh, do we want to set a umask?

prctl(PR_SET_DUMPABLE, 0) seems like another miscellaneous safeguard that could be slightly useful.

@kmcallister Yeah, netlink route is used for stuff like gethostbyname, unfortunately.

@kmcallister I don't actually know why we have to validate architectures (since it seems obvious that you can only call syscalls on the same architecture you're on?) but Chromium does it so I did it too.

@kmcallister It's actually important on some OS's to keep file descriptors open when entering the sandbox. Most notably, Chromium does this on Windows, because some OS APIs (e.g. fonts, IIRC) have to "warm up" by opening some handles, but which handles those are vary from OS release to OS release, so you can't really whitelist them portably.

@kmcallister Comments addressed. Here's a disassembly of the BPF for a restrictive filter: https://gist.github.com/anonymous/6988b4b8456678cb58f6

Firejail could be a source of inspiration.