mgoltzsche/ctnr

Simplify CNI plugin/bridge usage for unprivileged users

Opened this issue · 2 comments

user-mode networking is now supported within a user namespace. To allow containers to communicate with each other currently nested containers need to be created though. This provides a bad usability in most use-cases.

Instead users should also be able to setup all containers from within the host namespace and bridge them in a shared namespace with a single ctnr command.

Option 1: Move container execution into another, implicit, shared container

a) An implicit shared container with an own file system (from a stage image containing plugin binaries as in rkt) is maintained to run nested containers.
Problem1: If the shared container does not inherit the host's file system external files cannot be resolved.
It is also not possible to mount them into the outer container since this would require a container update which would break other containers that have already been started there.
Problem2: the container is not visible on the host anymore - only the outer container. But at least this way terminating the outer container also terminates the children (although they may leak kernel resources when not properly terminated/unmounted).

b) Alternatively the outer container could inherit the host's file system only having minimal isolation (userns, netns, mountns) to avoid breaking external file references and keep containers visible on the host.
Problem1: plugin binaries cannot be provided.
=> In the outer container the stage image's file contents could be mounted over the rootfs providing all required plugin binaries.
Problem2: Child containers on the host cannot be associated with the parent. On container termination on the host the bridge plugin would not be able to cleanup the veth when it is not run within the outer container's netns anymore which cannot be enforced using plain runc.
=> The child containers could be mapped to the outer container by writing their state into a separate directory and a naming convention. Thus the user should be made aware of the container hierarchy and deal with it explicitly which is probably not a bad idea and can be simplified in high-level tooling/compose. In order to make containers communicate with each other she may create a parent container/pod first and add containers or nested pods to it.

c) Another alternative that would provide both the necessary bridge namespace and the OCI/CNI plugin binaries would be to mount the host's namespace into a sub directory of the container and rewrite file references accordingly. Unfortunately the sub directory prefix may still be shown to the user in error messages.

EDIT:
In general this option also supports to run multiple containers as pod. The outer/pod container would define the network and the child/app containers would remain in the outer container's network namespace.
This needs to be done for pods anyway. Once this is done communication between pods can be achieved using another outer container - same problem a layer higher then. Container hierarchies should become part of the ctnr design as 1b shows.

Option 2: Move OCI CNI network hook execution into other namespace

The OCI network hook would require the namespaces it should enter before CNI execution configured.
Thus it could ensure that network deletion/termination is initiated from within the same namespace.
The container would be visible and can easily be controlled on the host.
The functionality could also be used with plain runc/other OCI runtimes.
Also other CNI plugins could benefit from this approach: for instance the existing portmap plugin could be used as well.

On the other hand it may limit CNI plugin capabilities since it can only be applied to ALL plugins:
For instance a plugin is planned to forward/proxy ports from the host to the container's netns using socat or similar.
This plugin would require explicit configuration of the namespace to forward ports to in addition to the container's namespace.
This would be acceptable since the plugin could be used to connect any namespace this way.

Where/how to create/lookup the userns/netns from within the OCI hook?
Create/join namespace by name dynamically? When to remove?
=> Make OCI hook join existing namespace only (tooling on top of runc must create/provide and remove the namespace when not needed anymore)
=> Make ctnr create/gc a container representing the namespace dynamically
=> The stage container is still required to get the hook/plugin binaries. Back to option 1?
=> These two features should be separated because one is about file system dependencies and the other about network namespaces which users may want to combine independent from each other?!

Option 3: Extend bridge plugin to bridge custom userns/netns with container

The bridge plugin would get an additional parameter specifying the namespaces to bridge to instead of the current namespace.
Thus it could ensure that network deletion/termination is triggered from within the same namespace.
The container would be visible and can easily be controlled on the host.
A plugin to map ports using socat would not need additional configuration of the namespace to map ports to (except when it should work with a completely different namespace).

=> Problem: the netns configuration is dynamic. if it should be part of any static plugin configuration a dependency to the container engine (no!) or another mapping would need to be created
=> provide userns/netns or rather process PID in runtime config as in portmap plugin
=> Problem: it is unclear for ctnr when to create the custom netns since it doesn't interpret CNI network configurations - it would always need to create it to be sure or
=> require an additional parameter

Option 4: Create new CNI plugin that enters custom netns and executes nested plugins there

=> provides most flexibility
=> Problems: as in option 3
=> The plugin should also manage the slirped shared network namespace itself: The namespace should be a simple host-independent name so that it can be configured statically within a CNI JSON file. The plugin should map the provided name to a namespace persistently. The plugin should also keep track of the namespace's users and terminate the slirp4netns process and destroy the shared namespace as soon as no container is using it anymore. This would also allow usage in other contexts as for instance with plain runc without ctnr and decouples the network logic from the rest.
Problem: The plugin requires a lot of the container engine's functionality!

Result so far

Actually three features were mixed here (especially in option 1):

  • Support bridging multiple containers to a shared namespace.
  • Provide additional binaries to create a container using a stage image as in rkt.
  • Support pods

EDIT:
This issue is about bridging containers but the pod feature could also solve it. Currently I'd go with option 1b.

@AkihiroSuda please have a quick look over this. I may be missing sth in my considerations.

Thanks for using slirp4netns.

Option 1 seems good, but I may change my thought later.

Related: containers/podman#1733 cc @giuseppe

I realized option 1 would also allow usage of overlayfs / containers/storage ...
EDIT: but obviously user.rootlesscontainers xattr still needs to be mapped and apparently overlayfs is not available for unprivileged users in a userns on every linux distribution (see state of the art of rootless containers and fuse-overlayfs)