containerd/containerd

Proposal: Sandbox API

mxpv opened this issue · 27 comments

mxpv commented

I’d like to bring up a discussion about Sandbox API in containerd.

Apparently we can not deny growth of popularity of containers with various flavors, such as Pods (e.g. a group of containers with shared namespace) or secure environments (aka micro VMs), like
fircracker-containerd, EKS on Fargate, or Kata containers.

On the other hand we need a path forward for higher level entities (container managers and orchestrators) to be able to run these container extensions transparently.

Today there is no defined way to do this, so everyone has to build own solution and solve same problems (how to manage microVM lifecycle? how to pass resource requirements? how to keep it orchestrator agnostic?).

This proposal introduces a notion of a group of containers in containerd - a “sandbox”. It aims to add a low-level lifecycle and resource management capabilities for containers that run inside of some environment (where "some" is defined and implemented by a client).

The sandbox concept has the following properties in relation to containers it hosts:

  • Sandbox acts as a parent entity for containers, e.g. it starts first and ends last (this typically useful in (micro)VM environments, where VMs are required to be started before any other entities).
  • Sandbox acquires resources needed for running child containers (for instance Kubernetes creates "pause" containers to acquire IP and network namespace for child containers).

The API aims to be implementation and orchestrator independent, as low level as possible, and introduces no dependencies to other Go packages.

It adds one more proxy plugin type - "sandbox", which implements Controller interface (similarly to snapshotters). Controller is responsible for platform specific sandbox implementation. It knows how to create/start/stop/delete a sandbox instance, check lifetime, report status, gather metadata, etc. containerd keeps track of running instances (in metadata store), generates proper lifecycle events, and translates client calls to a proper proxy plugin. Client can manage sandboxes via client API (example).

From orchestrator perspective there will be required a “bridge”, that translates orchestrator calls to sandbox API (e.g. cri-containerd, but implementation agnostic), sandbox controllers remain interchangeable.

sandbox4

Currently the sandbox API implemented in this fork (master...mxpv:sandbox) and exists as proof of concept in sandbox branch of firecracker-container repo.

cc: @samuelkarp @micahhausler @egernst

Also there is a good Twitter discussion about Kubernetes / Firecracker challenges working together: https://twitter.com/micahhausler/status/1238496944684597248

Seems like it should fit into containerd.

Is sandbox going to be the official name/term for what this is?

mxpv commented

Sandbox sounds good to me unless there are other proposals to discuss.

so with this API, do you need a way to Add something to the sandbox? how would you associate resources with it?

Is this expected to be in v1.4 or in v1.5?

mxpv commented

@crosbymichael Generally we'd want to have sandbox implementations to be responsible for resource allocations and clients just specify desired state.

Due to variety of resource types that might be needed for different sandbox implementations, it might be challenging to define a generic API for that. So containerd doesn't manage resources directly. It rather manages sandbox instances, that manage needed resources internally.

So from resource perspective there are 3 major API endpoints:

  • Create where sandbox implementations may acquire platform specific resources (and create a sandbox instance).
  • Update where we can do realloc of resources - for instance resize a microVM
  • Delete for cleanup.

To specify needed resources (desired state) we can use either Spec field (for generic ones similarly to runtime spec) or Extensions (which allow to specify sandbox specific configuration in generic way).

mxpv commented

Is this expected to be in v1.4 or in v1.5?

@AkihiroSuda 1.4 looks feature complete and close to betas. I'd prefer to have this in 1.5, so we have more time on polishing.

@mxpv Excellent proposal! This is going to be very useful for kata containers as well. I have one question, how is the Update sandbox API going to be called? Is there any plan to wire it up to the CRI interface (which has UpdateContainer but no UpdateSandbox)?

mxpv commented

Since CRI was designed more around containers, there is no right place to put Sandbox.Update yet, but this API is a good place to start.

Specifically UpdateContainerResourcesRequest targets container instances, not Pods, so it's unlikely we can apply it on sandbox level (since it's optional we can't just sum it up for all containers).

In longer perspective, with better support from CRI, Update can be used, for instance, for injecting volumes (secrets, configMaps, NFS, etc) or to resize sandbox instance size in order to limit Pod level resource consumption.

do you have an idea on how higher levels will interact with this new api? with cri-containerd need to be updated to take advantage of these APIs to replace that pause container?

mxpv commented

Right, cri-containerd would need to be updated to utilize sandbox API. Whereas "pause" containers logic moves to a "sandbox" plugin (becomes a variant of sandbox implementation). Sandbox plugins can replace one another or used side by side (so we specify "name" similarly to snapshotters). This way we have all sandbox implementations work with Kubernetes (see "Proposed" diagram above).

Sounds good to me. thanks

What's the fundamental difference from CRI? Would it make sense to just enhance CRI? CRI does not need to be k8s specific, I personally would love to see CRI expanded beyond only the needs of k8s.

mxpv commented

@ibuildthecloud Let's start from high level requirement here: "we're looking for a way to run microVM based containers on Kubernetes (1) and/or via containerd (2). We can do it together (via cri-containerd which implements CRI and uses containerd) if we want containerd runtime or these can be used independetly (like firecracker-containerd does today, it's purely containerd extensions).

Would it make sense to just enhance CRI?

This relates to (1). Yes, we still do need to enhance CRI either way for better microVM support, because CRI was designed mainly around "container" concept (as a random example: RunPodSandbox can specify resource requirements, so we can properly configure VMs at launch). We need that independently to a runtime used underneath. Essentially, to answer your question, yes, you can just maintain a separate implementations of (enhanced) CRI for different purposes (like cri-containerd to run regular containers, firecracker-cri to run containers inside Firecracker, kata-cri to run VMs, etc). But because we already have CRI implementation for containerd, we can extend containerd to support "sandboxes" (where "sandbox" can be replaced with Pod, microVM, VM, etc), and keep one CRI implementation.

CRI does not need to be k8s specific, I personally would love to see CRI expanded beyond only the needs of k8s.

So CRI is a Kubernetes specific concept. This is what you'd need to implement in order to make kubelet support custom runtime implementations in Kubernetes. Kubernetes team is very careful about API changes and backward compatibility, so it's going to be challenging to use CRI for something else beyond Kubernetes.

What's the fundamental difference from CRI?

Same difference as in RuntimeService.CreateContainer and containerd.NewContainer :) It's not about difference or similarities, but about levels of abstraction. containerd provides low level building blocks to run containers in highly extensible way. While Kubernetes reflects requirements to container runtimes via CRI interface. The proposal is add another building block - a sandbox abstraction and use it from CRI. This way whatever fully supports containerd API, will be able to run both regular containers and containers inside a microVMs. So we'd be able to get Kubernetes support by just replacing one of containerd's component (ideally).

I like the idea proposed to host pluggable sandbox types. The design concept shown in the right half of the diagram makes good sense. The left portion isn't really how CRI works today, or at least is an over simplification (probably to save space :-). So I'll just ramble a bit.

As stated CRI includes a Sandbox API in the runtime service apis, and that API group is already implemented by our code. The "notion of a group of containers in containerd - a sandbox that aims to add a low-level lifecycle and resource management capabilities for containers that run inside of some environment where some is defined and implemented by a client" you discussed... That is what k8s pod/sandboxes are shooting for and containerd plus our container runtime integration (CRI) is supposed to meet that client requirement. Can be better yes, does not have an easy way to replace the sandbox implementation and is sitting at the CRI interface so has a lot of baggage you won't want, to create other types of groupings of containers.

for instance Kubernetes creates pause containers to acquire IP and network namespace for child containers,..

I suppose one could see it that way logically but all that stuff is done by the CRI implementation with the aid of system services, CNI etc. We do that stuff in containerd/cri on the running of pod/sandboxes and containers. The pause container is by and large just a process to hold the resources shared across the pod/sandbox. That process could be hosted any number of other ways.

The sandbox api in CRI was never meant to be just for docker containers, but that certainly was the focus, and thus why the VM teams encountered issues. The sandbox api in CRI was always meant to be for VM runtimes, applications, compute resources... What may confuse/conflate the issue is the k8s team went back and forth with naming, mostly pod in docs but sandbox in the code, then eventually after much discussion it was agreed to call them podsandboxes, thus the podsandbox api.

...
Back to this design idea. Absolutely, if we had support for pluggable sandbox types abstracted at the containerd level then we could refactor the existing containerd/cri/pkg/sandbox*.go sources to host a podsandbox sandbox.

The solution, at the CRI layer, for supporting alternative container runtimes, was to add runtimes support. We could go a similar route adding sandbox switching support at the pod spec. We could use new/existing annotations to prototype any such pod spec changes.

Is this still planned for 1.5?

mxpv commented

For 1.6

Any MR for this proposal? It may help to support confidential computing too.
kata-containers/kata-containers#149
kata-containers/kata-containers#1332

Hi, I think with the sandbox plugin enabled, the "sandbox" will be a first class citizen in the containerd.
the RunPodSandbox CRI API will call sandbox plugin to start a sandbox, and we don't need the "sandbox container" any more.
when the sandbox started, it will publish its shimv2 task API through uds or vsock(or tcp or anything else).

for kata, the arch with sandbox plugin is like this:
image

for runc, it is like this:
image

I have made a poc project for kata with this new architecture, with the modification of @mxpv 's Draft, the container can already be started, and for kata container, there will be no shim process on the host anymore, because the kata-agent in the vm serves shimv2 task api, containerd "tasks" plugin can call the task api through vsock address, which is get from sandbox plugin.

[root@localhost feng]# cric  ps
CONTAINER           IMAGE                                             CREATED             STATE               NAME                ATTEMPT             POD ID
7962f8be2511e       rnd-dockerhub.huawei.com/official/ubuntu:stress   2 hours ago         Running             euleros             0                   4b8a983451df1
[root@localhost feng]# ps -ef | grep 4b8a983451df1
root      205941       1  0 15:10 ?        00:00:00 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-4b8a983451df135eeb167eebd36c7911e1c888ecaac899bca92717a33baf00fb -uuid 95cce392-2874-4f03-a64a-a58365ba9458 -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/tmp/4b8a983451df135eeb167eebd36c7911e1c888ecaac899bca92717a33baf00fb-qmp.sock,server,nowait -m 2048M,slots=10,maxmem=128G -device virtio-serial-pci,disable-modern=false,id=serial0,romfile= -device virtconsole,chardev=charconsole0,id=console0 -chardev file,id=charconsole0,path=/var/log/qemu/console-4b8a983451df135eeb167eebd36c7911e1c888ecaac899bca92717a33baf00fb.log -device virtio-9p-pci,disable-modern=false,fsdev=state-dir,mount_tag=state-dir,romfile= -fsdev local,id=state-dir,path=/run/kata-sandboxer/state/4b8a983451df135eeb167eebd36c7911e1c888ecaac899bca92717a33baf00fb,security_model=none -device virtio-9p-pci,disable-modern=false,fsdev=root-dir,mount_tag=root-dir,romfile= -fsdev local,id=root-dir,path=/run/kata-sandboxer/root4b8a983451df135eeb167eebd36c7911e1c888ecaac899bca92717a33baf00fb,security_model=none -device vhost-vsock-pci,disable-modern=false,vhostfd=3,id=vsock1,guest-cid=4097480961,romfile= -object rng-random,id=rng1,filename=/dev/urandom -device virtio-rng-pci,rng=rng1,romfile= -vga none -no-user-config -nodefaults -nographic -no-reboot -daemonize -realtime mlock=off -kernel /var/lib/kata/kernel -initrd /var/lib/kata/kata-containers-initrd.img -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 systemd.show_status=false panic=1 nr_cpus=56 agent.use_vsock=true systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none -pidfile /tmp/4b8a983451df135eeb167eebd36c7911e1c888ecaac899bca92717a33baf00fb.pid -D /var/log/qemu/sandbox-4b8a983451df135eeb167eebd36c7911e1c888ecaac899bca92717a33baf00fb.log -smp 1,cores=1,threads=1,sockets=1,maxcpus=1

the sandbox plugin is like snapshotter plugin, In the CRI RunPodSandbox method, we just need to add WithNewSandbox opt to the opts.
image

I think making sandbox a first class citizen will make the architecture much cleaner than what it is now. the pause container can be removed, and for vm based container, no shim is needed.

dims commented

the pause container can be removed <<<<< Music to my ears!!

the pause container can be removed <<<<< Music to my ears!!

A solution may be to split PodSandbox into "Pod" and "Sandbox". Then sandbox is the environment to run pod, and pod is used to group several containers and share namespaces.
For kata containers, the main flow changes to:

  1. create sandbox with network
  2. create pod
    2.1) pull pause container image
    2.2) setup bundle for pause container
    2.3) create and start pause container
    2.4) setup other. pod env
  3. create app containers
    3.1) pull app container image
    3.2) setup bundle for app container
    3.3) create and start app container

I think the "pause container" is much like one kind of implementation of sanbox. maybe for the new runc sandbox plugin, we can directly start the shimv2 server in the new namespace, as a replace of the old "pause" process, and listen to an abstract unix domain socket. @jiangliu

for runc, it is the runc sandbox plugin's responsbility to provide an environment to run pod, whether to start a "pause" process is up to the plugin's decision. the creation of pause container after the sandbox plugin created a sandbox seems to be a redundant step.

at least for the vm based container, the pause container is really redundant.

... an abstract unix domain socket ...

(We probably don't want to bring back CVE-2020-15257, right? 😅)

with that in concern, maybe shimv2 server should still start itself in the root namespace, and start a pause process itself in the pod's namespace. I think still there is no need of a "pause container". the sandbox plugin can manage the "sandbox", creating a pause container is actually creating a sandbox, which is duplicated with the work of sandbox plugin. @tianon

I think the "pause container" is much like one kind of implementation of sanbox. maybe for the new runc sandbox plugin, we can directly start the shimv2 server in the new namespace, as a replace of the old "pause" process, and listen to an abstract unix domain socket. @jiangliu

for runc, it is the runc sandbox plugin's responsbility to provide an environment to run pod, whether to start a "pause" process is up to the plugin's decision. the creation of pause container after the sandbox plugin created a sandbox seems to be a redundant step.

at least for the vm based container, the pause container is really redundant.

It depends on the implementation details. If there's nothing within the VM but out of the pod, the pause container is redundant. But if we run some services within the VM but out of the pod, we may still need the pause container as runC.
Take Kata Containers as an example, we still run kata-agent/chronyd etc within the VM but outside of pod.

Yes, but actually whether the pause container can be removed is not point,What I want to emphasize is that after the sandbox is started, the shimv2 API needs to be exposed instead of waiting for the tasks plugin to start a shimv2 process.

c3d commented

Here is another example of issue that would benefit from having more information from containerd:

kata-containers/kata-containers#2071

The problem there is that Kata Containers cannot correctly compute the number of VCPUs because the information is not there. It can be inferred in the case of CRI-O through some annotation, but that's not robust.

The initial Sandbox API PR is merged.