# Deprecated Please note that the CGManager project has been deprecated in favor of using the kernel's CGroup Namespace or lxcfs' simulated cgroupfs. See https://s3hh.wordpress.com/2016/06/18/whither-cgmanager/ for details. === Intro === This is a motivation, description and explanation of the cgmanager design. The original design RFC was described here: http://lwn.net/Articles/575672/ http://lwn.net/Articles/575683/ And much of it still holds (and is cut-pasted, though edited, here). === Cgmanager Design === One of the driving goals is to enable nested lxc as simply and safely as possible. If this project is a success, then a large chunk of code can be removed from lxc. I'm considering this project a part of the larger lxc project, but given how central it is to systems management that doesn't mean that I'll consider anyone else's needs as less important than our own. This document consists of two parts. The first describes how I intend the daemon (cgmanager) to be structured and how it will enforce the safety requirements. The second describes the commands which clients will be able to send to the manager. The list of controller keys which can be set is very incomplete at this point, serving mainly to show the approach I was thinking of taking. === Summary === Each 'host' (identified by a separate instance of the linux kernel) has exactly one running daemon to manage control groups. This daemon answers cgroup management requests over a dbus socket, located at /sys/fs/cgroup/cgmanager/sock. The /sys/fs/cgroup/cgmanager directory can be bind-mounted into various containers, so that one daemon can support the whole system. (Bind-mounting the directory rather than the socket itself allows a container to proceed if the cgmanager is restarted, creating a new socket.) Outline: . A single manager, cgmanager, is started on the host, very early during boot. It has very few dependencies, and requires only /proc, /run, and /sys to be mounted, with /etc ro. It mounts the cgroup hierarchies in a private namespace and set defaults for clone_children and use_hierarchy. It opens a Unix socket at /sys/fs/cgroup/cgmanager/sock. . A client (requestor 'r') can make cgroup requests over /sys/fs/cgroup/cgmanager/sock using dbus calls. Detailed privilege requirements for r are listed below. . The client request will pertain an existing or new cgroup A. r's privilege over the cgroup must be checked. r is said to have privilege over A if A is owned by r's uid, or if A's owner is mapped into r's user namespace, and r is root in that user namespace. . The client request may pertain a victim task v, which may be moved to a new cgroup. In that case r's privilege over both the cgroup and v must be checked. r is said to have privilege over v if v is mapped in r's pid namespace, v's uid is mapped into r's user ns, and r is root in its userns. Or if r and v have the same uid and v is mapped in r's pid namespace. . r's credentials will be taken from socket's peercred, ensuring that pid and uid are translated. . A request to chown a cgroup requires a uid U and gid G. . If r is in the same pid and user namespaces as the cgmanager, then v, U and G can be passed as integer arguments over the D-Bus requests. . If r is not in the same namespaces as the cgmanager, then V, U and G must be passed as SCM_CREDENTIALs so that the cgmanager receives the translated global pid/uid/gid. Since D-Bus does not support sending SCM_CREDENTIALs as part of a D-Bus message, the D-Bus arguments include a file descriptor. The SCM_CREDENTIALs are sent over the file descriptor after the D-Bus transaction completes, and the final result is sent over the same file descriptor. . It is desirable that all transactions can be accomplished with simple D-Bus transactions. Therefore a cgroup manager proxy (cgproxy) is provided. This will move /sys/fs/cgroup/cgmanager to /sys/fs/cgroup/cgmanager.lower, then serve as a proxy translating D-Bus requests received on /sys/fs/cgroup/cgmanager/sock into SCM-enhanced D-Bus requests on /sys/fs/cgmanager/cgmanager.lower/sock. . In plain D-Bus transactions, the requestor r's credentials are read from the socket. . In SCM-enhanced D-Bus transactions, the proxy p's credentials are read from the socket. The requestor's credential is sent as an SCM_CREDENTIAL. Privilege requirements by action: * Requestor of an action (r) over a socket may only make changes to cgroups over which it has privilege. * Requestors may be limited to a certain #/depth of cgroups (to limit memory usage). This is not yet implemented. * Cgroup hierarchy is responsible for resource limits. To this end, a request to chown cgroup A to uid U will only chown the directory itself (allowing child cgroup creation) and the tasks and cgroup.procs file. * A requestor must either be uid 0 in its userns with victim mapped ito its userns, or the same uid and in same/ancestor pidns as the victim * If r requests creation of cgroup '/x', /x will be interpreted as relative to r's cgroup. r cannot make changes to cgroups not under its own current cgroup. * Root in the cgmanager's pid namespace may 'escape' to the cgmanager's cgroup with a special MovePidAbs command. * A proxy may move a task over which it has privilege to the proxy's own cgroup. This allows the proxy to mimic the cgmanager's special root-may-escape semantics in its own container. * If r requests creation of cgroup '/x', it must have write access to its own cgroup. * if r requests setting a limit under /x, then . either r must be root in its own userns, and UID(/x) be mapped into its userns, or else UID(r) == UID(/x) . /x must not be / (not strictly necessary, all users know to ensure an extra cgroup layer above '/') . setns(UIDNS(r)) would not work, due to in-kernel capable() checks which won't be satisfied. Therefore we'll need to do privilege checks ourselves, then perform the write as the host root user. (see devices.allow/deny). Further we need to support older kernels which don't support setns for pid. Types of requests: * r requests creating cgroup A'/A . lmctfy/cli/commands/create.cc . Verify that UID(r) mapped to 0 in r's userns . R=cgroup_of(r) . Verify that UID(R) is mapped into r's userns . Create R/A'/A . chown R/A'/A to UID(r) * r requests to move task x to cgroup A. . lmctfy/cli/commands/enter.cc . r must send PID(x) as ancillary message . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into that userns (is it safe to allow if UID(x) == UID(r))? . R=cgroup_of(r) . Verify that R/A is owned by UID(r) or UID(x)? (not sure that's needed) . echo PID(x) >> /R/A/tasks * r requests chown of cgroup A to uid X . X is passed in ancillary message * ensures it is valid in r's userns * maps the userid to host for us . Verify that UID(r) mapped to 0 in r's userns . R=cgroup_of(r) . Chown R/A to X * r requests cgroup A's 'property=value' . Verify that either * A != '' * UID(r) == 0 on host In other words, r in a userns may not set root cgroup settings. . Verify that UID(r) mapped to 0 in r's userns . R=cgroup_of(r) . Set property=value for R/A * Expect kernel to guarantee hierarchical constraints * r requests deletion of cgroup A . lmctfy/cli/commands/destroy.cc (without -f) . same requirements as setting 'property=value' * r requests purge of cgroup A . lmctfy/cli/commands/destroy.cc (with -f) . same requirements as setting 'property=value' Long-term we will want the cgroup manager to become more intelligent - to place its own limits on clients, to address cpu and device hotplug, etc. Since we will not be doing that in the first prototype, the daemon will not keep any state about the clients. === Another look at the safety of requests === Notes: 1. In a plain D-Bus call, the proxy is the requestor. 2. If a client does an SCM call to the cgmanager socket, then the proxy is the requestor. 3. In any call over a proxy, the proxy won't be able to make changes outside its own cgroups. If it misbehaves, damage is contained so it only damages itself.. 4. Chained proxying is not supported. If a proxy gets a request where proxy != requestor, the call is rejected. 5. The identity of the proxy (which may be the requestor) cannot be forged; it is taken from the socket credential. A more privileged user must not allow a less privileged task to have access to the opened DBus socket, as the credential will be that at the time of connect(). On newer kernels, cgmanager can tell whether a proxy or requestor is in the same namespace as itself. On older kernels, it cannot. . for Create, this is ok. We have the proxy's real pid and can constrain create under its cgroup. . for getPidCgroup, we can ensure that only results under the parent's cgroup are returned. we can NOT ensure that results will make sense for plain DBus calls, as we cannot guarantee that proxy is in the same ns as cgmanager. However, this is not unsafe. When we can and do detect that p is in a different pid namespace, then we reject the call, because the result cannot be sensible. . for chmod: We constrain under proxy's cgroup, so this is safe. . for chown: on older kernel we cannot guarantee that the uid/gid make sense on the host; However . root on the root host - no translation necessary . root in a non-user-ns container: no translation necessary . root in a unprivileged container: won't have privilege to do any chown without going through a proxy. Therefore rejecting calls from another namespace is not necessary. The worst it will do is to give -EPERM for calls which for root in a unprivileged container otherwise would be allowed to do. . movepid: . root on root host - fine . root in a non-user-ns container: we can only ensure that the victim be under the proxy's cgroup. If that is the case, then root (which is also root on the host) is allowed to move the task. When we can and do detect a different pid namespace, then we reject the call because the results cannot make sense. . MovePidAbs: On an older kernel, or if the task is in a different namespace, then this requires a proxy. The cgmanager will only allow escaping up to the level of the proxy. . root on root host - allowed to escape. . root in a non-user-ns container: allowed to escape up to the proxy's level. If the host misconfigures the container so that the host's proxy is in the container, then root can escape completely. . if root tries to mimick a proxy, then it can only escape to the proxy's level - it's own. So it cannot escape at all.