kata-containers/kata-containers

image pulling inside sandbox

bergwolf opened this issue ยท 26 comments

Motivation

We want to allow to pull container images inside sandbox for the following reasons:

  • security: some users do not want their container image data to be present on the host
  • isolation: hard multi-tenancy requires that different users' container image data must not be mixed together
  • private registry: for cloud providers, it is possible that users are pulling from a private registry that is only accessible from within a user's VPC network
  • charging: for cloud providers, it is important to charge network usage of container image

Architecture

There are two possibilities for image pulling inside sandbox. The first one is that we just pull container images inside the guest. The downside is that container images are not shared across sandboxes.

image

The other one is that we contine pulling container images on the host, but inside the sandox namespace/cgroups. The downside is that the image daemon is more complex to implement.

image

Changes to Shimv2 API

  • Add PullImage API to ask the shim to pull a container image and provide necessary auth info to let it accomplish it
  • PullImage(ctx context.Context, req *PullImageRequest)
  • Change CreateContainer API to pass in an image reference so that the shim knows which image a container is using
  • Add Image *ImageSpec field to CreateTaskRequest

Changes to agent API

  • Add PullImage API to ask the agent to pull a container image to a specific location
  • PullImage(ctx context.Context, req *PullImageRequest)

Containerd/CRI modification

  • When pulling an image for a container, ask the shim to do it instead of handling it inside containerd
  • When creating a container, send the container image reference to the shim instead of trying to resolve it into a local path inside containerd

Impact on Container Image Life Cycle Management

  • containerd needs to query all shims to collect global status of container images on a host
  • kubelet needs to be tenant-aware so that same image can exist for different users/sandboxes
  • TBD.

I think this is something we could make happen within containerd. Right now, the image/content/snapshot service is decoupled from tasks(runtime) so I don't think this would impact any of the existing APIs. Most of the work would be within the CRI plugin for this path.

@bergwolf please review and decide if we still need this feature (if so remove the "needs-review" label).

@bergwolf is this feature still in near plan? @crosbymichael if this happens within containerd, what about other high level runtime like https://github.com/cri-o/cri-o?

/cc

@bergwolf it looks like this issue has been closed due to the clean up action that was discussed here http://lists.katacontainers.io/pipermail/kata-dev/2021-April/001819.html

Do you plan to reopen this issue? I am working in the Z organisation at IBM and we are interested in this feature. Thanks

What would be the best way to link this to the Confidential Computing RFC [#1332]?
Open to suggestions/practices used here, as I see this as a possible Issue to cover part of the solution required for Confidential Computing?

sameo commented

What would be the best way to link this to the Confidential Computing RFC [#1332]?
Open to suggestions/practices used here, as I see this as a possible Issue to cover part of the solution required for Confidential Computing?

It's definitely linked to the confidential computing discussions, and the service offload toggle will hopefully come with the initial PR for it.

I believe the actual implementation can be done in parallel though, and I'm in favor of discussing about it through this issue.

sameo commented

@bergwolf initial proposal here is about only pulling the image inside the guest. Should we extend that proposal to most (if not all of it) of the container image service? That would open the door for pulling and mounting from the guest, but storing on the host (by using e.g. a virtiofs or a block device backed with host physical storage).

@sameo the proposal is to pull image inside the sandbox so it includes pulling from both inside the guest and outside on the host. Does it satisfy your requirements?

We have some rough ideas on the image offloading after my discussion with @ashleyrobertson and @magowan

  • This is my understanding of containerd and kata components involved image pulling today (projects are separated in color):

image

  • Where the potential changes are marked in red circle to support image pull in guest.
  1. PullImage in containerd's criService
    We need identify the image type, like pause, image need offload or not need offload, then forward the PullImage request to taskService (need extended) and then to guest agent. The PullImage response from agent contains not only digest but also images metaData, so that it could be cached in imageStore in conntainerd, as today.
  2. CreateContainer in containerd's criService
    We need identify the container type, and don't create snapshots and bundle for containers where their images gonna be pulled in sandbox.
  3. shim v2 APIs in containerd and implementation in kata
    Need extend the APIs to add Images related function, like PullImage.
  4. kata agent
    To listen on the image pull request and pull images, maybe leverage some tools like skopeo or buildah
    To create bundle against the images pull down in guest before create/start container.

So the whole arch changed from exiting like:
image

To new arch like:
image

Where shimv2 may be extended like:
image

The intention is to change API PullImage only but do not touch ListImages, ImageStatus, ImageFsInfo because the images information(digest, metadata etc) will be returned from guest agent to criService and stored/cached in containerd. While for RemoveImage it's optional depends on is there a real case to delete image contents in guest.

So, new lifecycle changes to:
image

@huoqifeng -- can you present a diagram for pod lifecycle (similar to what you have for "today" above) based on your proposal for new arch?

I'm not confident that the image pull would occur after the createPodSandbox, and so am unsure how this will be handled. More specifically, if the sandbox is not up, there is not VM with a kata-agent running to manage pulling the image inside the guest.

@huoqifeng -- can you present a diagram for pod lifecycle (similar to what you have for "today" above) based on your proposal for new arch?

I'm not confident that the image pull would occur after the createPodSandbox, and so am unsure how this will be handled. More specifically, if the sandbox is not up, there is not VM with a kata-agent running to manage pulling the image inside the guest.

@egernst I updated the diagrams above to reflect proposed lifecycle. the pause image will still be pulled in createPodSandbox on host, which is used to create PodSandbox and launch guest VM. There is no change in this part.

First let's talk about the scope and assume we are

  1. focusing on "Pod/Container Run" stage and do not cover the "Container Build and Ship" stage,
  2. supporting confidential and non-confidential computing, with an emphasis on confidential computing.
  3. basing on the CRI interface, no support for generic containerd client interfaces.
    Any more rule to scope our work?
sameo commented

First let's talk about the scope and assume we are

1. focusing on "Pod/Container Run" stage and do not cover the "Container Build and Ship" stage,

Agreed.

2. supporting confidential and non-confidential computing, with an emphasis on confidential computing.

I think this should be designed and implemented independently from confidential computing at all. The dependency goes only one way: confidential computing depends on that, but not the other way around.

3. basing on the CRI interface, no support for generic containerd client interfaces.

I think the proposed change would mostly be at the shim API rather.

c3d commented
  1. supporting confidential and non-confidential computing, with an emphasis on confidential computing.

I think this should be designed and implemented independently from confidential computing at all. The dependency goes only one way: confidential computing depends on that, but not the other way around.

I am not entirely sure this is a safe assumption. More precisely, there should be no code dependency initially, but I believe there is an architecture dependency. Here are two points I believe impact the diagrams above if we want to be able to accommodate CC later:

  1. A different trust domain (the tenant) is responsible for green-lighting startContainer. In particular, you can't start the container before attestation. It seems to me that would even be true of any pause container. Therefore, the architecture must be such that the sequence between VM-boot, attestation, agent init, image load, image validation (if any) and container start is respected, and that the relevant measurement points can be put under the control of the tenant (i.e. they do not rely on the host). As I see it, the current arch happens to have these properties (or at least is not incompatible with them), but we want to make sure it is not luck.
  2. In the long term, the tenant will probably not trust file-based access to storage (e.g. virtiofsd), so we should consider being able to do the image download on a block device encrypted with some tenant-only key. This means that the "metadata" in the above diagrams comes from two distinct sources in that model. One half comes from the host, e.g. some host mount point or device; the other half comes from the tenant, e.g. encryption keys for the image storage.

This is not to mean that we cannot do things incrementally, but at least have a clear vision of where we are going will avoid having to redo the APIs later.

sameo commented
1. A different trust domain (the tenant) is responsible for green-lighting `startContainer`. In particular, you can't start the container before attestation. It seems to me that would even be true of any pause container. Therefore, the architecture must be such that the sequence between VM-boot, attestation, agent init, image load, image validation (if any) and container start is respected, and that the relevant measurement points can be put under the control of the tenant (i.e. they do not rely on the host). As I see it, the current arch happens to have these properties (or at least is not incompatible with them), but we want to make sure it is not luck.

2. In the long term, the tenant will probably not trust file-based access to storage (e.g. virtiofsd), so we should consider being able to do the image download on a block device encrypted with some tenant-only key. This means that the "metadata" in the above diagrams comes from two distinct sources in that model. One half comes from the host, e.g. some host mount point or device; the other half comes from the tenant, e.g. encryption keys for the image storage.

I agree with the 2 above points, but I believe they're also emphasizing the fact that CC depends on that feature, but not the other way around :-)
Regardless of where the CC implementation decides to store the image layers, and when the image loading/pulling would be called, the image service implementation (not the agent or runtime modifications) should be CC agnostic.

sameo commented

@huoqifeng For the record, would you mind elaborating on why we'd only need to add PullImage but no other image related commands to the shim API?

  1. supporting confidential and non-confidential computing, with an emphasis on confidential computing.

I think this should be designed and implemented independently from confidential computing at all. The dependency goes only one way: confidential computing depends on that, but not the other way around.

I am not entirely sure this is a safe assumption. More precisely, there should be no code dependency initially, but I believe there is an architecture dependency. Here are two points I believe impact the diagrams above if we want to be able to accommodate CC later:

  1. A different trust domain (the tenant) is responsible for green-lighting startContainer. In particular, you can't start the container before attestation. It seems to me that would even be true of any pause container. Therefore, the architecture must be such that the sequence between VM-boot, attestation, agent init, image load, image validation (if any) and container start is respected, and that the relevant measurement points can be put under the control of the tenant (i.e. they do not rely on the host). As I see it, the current arch happens to have these properties (or at least is not incompatible with them), but we want to make sure it is not luck.
  2. In the long term, the tenant will probably not trust file-based access to storage (e.g. virtiofsd), so we should consider being able to do the image download on a block device encrypted with some tenant-only key. This means that the "metadata" in the above diagrams comes from two distinct sources in that model. One half comes from the host, e.g. some host mount point or device; the other half comes from the tenant, e.g. encryption keys for the image storage.

This is not to mean that we cannot do things incrementally, but at least have a clear vision of where we are going will avoid having to redo the APIs later.

I'm strongly agreeing with these two points.

  1. Confidential computing will strongly affect Kata architecture, evolving into a real Kata 2.0 arch. Originally Kata acts as a runtime engine, inter exchangeable with runC. That means Kata needs to strictly keep compatibility with runC, especially to setup bundle on host and share the bundle to vm by 9p/virtiofs. If we keep this design constraint, it never becomes "Confidential Computing". So the biggest change to Kata architecture is to break the constraint to be compatible with runC, and evolve Kata from a container runtime to be a sandbox engine + container runtime. The proposal containerd/containerd#4131 definitely helps us.
  2. 9p/virtiofs is not an option for confidential computing, fs sematics is too complex so we should rely on block based interfaces when data travels across trust domains.
3. basing on the CRI interface, no support for generic containerd client interfaces.

I think the proposed change would mostly be at the shim API rather.

According to my investigation, we need to enhance kubelet/CRI/containerd/kata.

  1. enhance k8s/kubelet support runtime class of "Confidential Computing" and allocating resources (such as key id) for it.
  2. enhance CRI image service to support pod specific images. Currently CRI/containerd images are global(though namespaced), we need a new abstraction of "Pod specific images" which belongs to a dedicated pod. Currently CRI.PullImage request has a pod context attached, so we could associate image pull requests with specific pod if needed. But we need to add a pod context parameter to CRI image service List/Status/Get/Remove requests if we want to support image GC within a pod. Otherwise current CRI image service interfaces may be kept as is.
  3. Invent shimv3 interfaces. Currently shimv2 interface is designed for Container Runtime, we need another set of interface for Sandbox Engine, to manage images within vm, to setup bundle within vm, to manage mountpoint within vm, to get container runtime etc.
  4. With the new shimv3 interface for sandbox engine, the flow to create a pod basically changes to:
    a) create sandbox
    b) pull image and prepare bundle for pause container by using sandbox engine interface
    c) get runtime driver from sandbox and start the pause container
    d) pull image and prepare bundle within vm for app container
    e) prepare a special bundle/config on host with volume/device assignment information, which will be used to pass through volume/device to vm
    f) start app container

So the key point is to evolve Kata to be a sandbox engine instead of patching current container centric shimv2 interfaces.

@huoqifeng For the record, would you mind elaborating on why we'd only need to add PullImage but no other image related commands to the shim API?

@sameo , The intention is to change API PullImage only but do not touch ListImages, ImageStatus, ImageFsInfo because the images information(digest, metadata etc) will be returned from guest agent to criService/imgService and stored/cached in containerd. While for RemoveImage it's optional depends on is there a real case to delete image contents in guest. I updated the section in previous proposal also.

sameo commented

@huoqifeng For the record, would you mind elaborating on why we'd only need to add PullImage but no other image related commands to the shim API?

@sameo , The intention is to change API PullImage only but do not touch ListImages, ImageStatus, ImageFsInfo because the images information(digest, metadata etc) will be returned from guest agent to criService/imgService and stored/cached in containerd. While for RemoveImage it's optional depends on is there a real case to delete image contents in guest. I updated the section in previous proposal also.

Thanks for the clarification. A few questions related to that approach:

  • How would containerd/CRIO handle encrypted layers for which they don't have the private keys?
  • What would happen if a pod asks for the same image layers? Would CRIO/containerd try to re-use those encrypted layers provided that they are stored on the host? Or is that not the intention?
  • What happens when the pod that pulled the image layers terminates? Does it wipe the image layers out? Or is the intention to store them on the host for sharing purposes?
3. basing on the CRI interface, no support for generic containerd client interfaces.

I think the proposed change would mostly be at the shim API rather.

According to my investigation, we need to enhance kubelet/CRI/containerd/kata.

1. enhance k8s/kubelet support runtime class of "Confidential Computing" and allocating resources (such as key id) for it.

2. enhance CRI image service to support pod specific images. Currently CRI/containerd images are global(though namespaced), we need a new abstraction of "Pod specific images" which belongs to a dedicated pod. Currently CRI.PullImage request has a pod context attached, so we could associate image pull requests with specific pod if needed. But we need to add a pod context parameter to CRI image service List/Status/Get/Remove requests if we want to support image GC within a pod. Otherwise current CRI image service interfaces may be kept as is.

3. Invent shimv3 interfaces. Currently shimv2 interface is designed for Container Runtime, we need another set of interface for Sandbox Engine, to manage images within vm, to setup bundle within vm, to manage mountpoint within vm, to get container runtime etc.

4. With the new shimv3 interface for sandbox engine, the flow to create a pod basically changes to:
   a) create sandbox
   b) pull image and prepare bundle for pause container by using sandbox engine interface
   c) get runtime driver from sandbox and start the pause container
   d) pull image and prepare bundle within vm for app container
   e) prepare a special bundle/config on host with volume/device assignment information, which will be used to pass through volume/device to vm
   f) start app container

So the key point is to evolve Kata to be a sandbox engine instead of patching current container centric shimv2 interfaces.

@jiangliu I think 3. and 4. align with the longer term goal, 1. and 2. have more larger scope.
Maybe, the trade off is should the image offload be proceeded incrementally.

* encrypted layers for which they don't have

@sameo I think the proposal is to download and handle the images within guest VM but don't save them on host, so that

  • containerd/CRIO does not need handle encrypted layers.
  • if another pod on same worker asks for the same image layers, further image pull happens again.
  • when the pod that pulled the image layers terminates, image layers are also gone.
* encrypted layers for which they don't have

@sameo I think the proposal is to download and handle the images within guest VM but don't same them on host, so that

  • containerd/CRIO does not need handle encrypted layers.
  • if another pod on same worker asks for the same image layers, further image pull happens again.
  • when the pod that pulled the image layers terminates, image layers are also gone.

yes, we call it scoped image management. Currently have two types of scope:

  1. global scope, which the is current containerd image management mode.
  2. pod scope, the image lifetime is bound to the associated pod and is only available to the pod.
    Logically there's a third scope, user scope for multi-tenants, which shares images among all pods of a specific user.

Given all the progress in the CoCo project I think we can close this. There are still some challenges, but they should be reflected in more recent issues