Container Bootfs (casync and desync inside.)

About bootfs

Container Bootfs(bootfs below.) is an image converter aiming to minimize image size and speed up boot time with block-level image chunking, de-duplication and lazy-pull execution without any modifications on container runtimes, registries nor any dedicated NFS infrastructure. The status of this project is Rough PoC, and bootfs is incomplete as mentioned later. Currently, we leverage casync and desync for provisioning rootfs and lazy-pulling image data.

Context

Now, next generation of OCI image spec is under active discussion!

Roughly speaking, some of points which has been frequently discussed about current OCI images are following.

Storage inefficiency of layer-level de-duplication.
Slow container startup time caused by lack of lazy-pull functionality.
Lack of seek functionality on tar archive format.

So far, several effective concepts has been proposed.(alphabetical order)

CernVM-FS Graph Driver Plugin for Docker : http://iopscience.iop.org/article/10.1088/1742-6596/1085/3/032019
FILEgrain : https://github.com/AkihiroSuda/filegrain
Slacker : https://www.usenix.org/node/194431
atomfs and stacker : https://fosdem.org/2019/schedule/event/containers_atomfs/

If there are projects should be added, please let me know.

Architecture

We aimed to achieve container image's block-level de-duplication on store, on transfer and on the execution node and the lazy-pull execution without any modification on container runtime or registry, as following.

Converting images.

You can generate new images which can be stored in de-dup manners and can be run with lazy-pull, based on an existing one. We developed the image converter which generates following data.

Boot image: Generated Docker image. This image include boot program which has responsibility to set up the execution environment on boot, using casync and desync (both of them are also included in the image), then exec the original ENTRYPOINT app in the container. We use casync for provisioning the original image's rootfs with FUSE based on included metadata (aka caibx or caidx). By desync process, most of the original rootfs data will be pulled lazily from remote chunk store on access, and cached locally. We use desync's cache functionality, so if some blobs are on the node, desync just use these blobs without pulling them remotely, which leads to block-level de-duplication on transfer. If you use container's volume as local cache, this can be shared with several containers on the node, then you can achieve block-level inter-container de-duplication on the node. This boot image follows Docker image spec, so you can pull and run it from container registry in very normal ways without modification on the container runtime or registry. Recently, we are trying on several kinds of archive formats other than catar, for example, ISO9660 which has index header on top of the archive so we can pull arbitrary files lazily without parsing the entire archive (which tar or catar needs).
Rootfs blobs : The original image's block-level CDC-chunked rootfs blobs. We use casync for chunking. Put this blobs on somewhere like a cluster-global storage (we call it remote chunk store). If you store some sets of blobs generated by some containers in a same store, you can achieve block-level de-duplication on the store.

Running images.

At runtime, the boot program sets up the execution environment, casync (for catar) or mount command (for ISO9660) provisions the original rootfs using FUSE and desync pulls the rootfs blobs lazily from remote chunk store, as mentioned above. The remote chunk store can be anything desync supports. As we will mention in TODO list later, by extending desync to be able to talk container registry API, we believe that we can combine the boot image and the blobs into one OCI compatible image using similar way of FILEgrain project doing and that we can pull it in a manner of container runtimes doing, which means we don't need to have dedicated remote chunk stores anymore. In our example, we use a SSH server with casync installed. See the sample SSH server container's Dockerfile in this repo which is quite simple.

FROM rastasheep/ubuntu-sshd:latest
RUN apt update -y && apt install -y casync
CMD ["/usr/sbin/sshd", "-D"]

Sharing cache.

When you use a volume as local cache, you can share it among containers on the node. The volume needs to be mounted on a specific path (/.bootfs/rootfs.castr). It is useful to keep the volume using a simple volume-keeper container like below, and share it among containers.

FROM busybox:latest
RUN mkdir -p /.bootfs/rootfs.castr
VOLUME /.bootfs/rootfs.castr
CMD tail -f /dev/null

TODO

Currently, the status of bootfs is Rough PoC. So currently, this is not perfect. Some of the TODOs are listed below.

Play with sample.

Preparation.

We use cache container's volume as local cache on the node, and a SSH server container named ${SSH_SERVER_NAME} as remote chunk store which use a volume ${SSH_SERVER_STORE}. We use ${CONVERTER_OUTPUT_DIR} directory to gain the converted image by the image converter.

LOCAL_CACHE_NAME=node-local-cache
LOCAL_CACHE_STORE=/tmp/node-local-cache
SSH_SERVER_NAME=ssh-casync-server
SSH_SERVER_STORE=/tmp/ssh-casync-server-store
CONVERTER_OUTPUT_DIR=/tmp/converter-output
mkdir ${LOCAL_CACHE_STORE} ${SSH_SERVER_STORE} ${CONVERTER_OUTPUT_DIR}

Build remote chunk store container at /sample/ssh and run it.

sudo docker build -t ${SSH_SERVER_NAME}:v1 .
sudo docker run --rm -d --network="bridge" \
                --name ${SSH_SERVER_NAME} \
                -v ${SSH_SERVER_STORE}:/store \
                ${SSH_SERVER_NAME}:v1
SSH_SERVER_IP=$(sudo docker inspect ${SSH_SERVER_NAME} --format '{{.NetworkSettings.IPAddress}}')

Then build the local cache container at /sample/cache and run it.

sudo docker build -t ${LOCAL_CACHE_NAME}:v1 .
sudo docker run --rm -d \
                --name ${LOCAL_CACHE_NAME} \
                -v ${LOCAL_CACHE_STORE}:/.bootfs/rootfs.castr \
                ${LOCAL_CACHE_NAME}:v1

Build the image converter.

Build the image converter as container at / of this repo.

sudo docker build -t mkimage:latest .

Convert images.

sudo docker run -i -v /var/run/docker.sock:/var/run/docker.sock \
                -v ${CONVERTER_OUTPUT_DIR}:/output \
                mkimage:latest ubuntu:latest ubuntu-converted:latest

Then, store the blobs into remote chunk store container's volume.

sudo mv ${CONVERTER_OUTPUT_DIR}/rootfs.castr/* ${SSH_SERVER_STORE}/

Run it.

sudo docker run -it --privileged --device /dev/fuse \
                --volumes-from ${LOCAL_CACHE_NAME} \
                -e BLOB_STORE=ssh://root@${SSH_SERVER_IP}/store \
                -e DROPBEAR_PASSWORD=root \
                ubuntu-converted:latest

You can share the local cache among containers by specifying --volumes-from ${LOCAL_CACHE_NAME} runtime option.

Measure it.

We can see how many block-level blobs are actually pulled lazily. On boot, the number of cached blobs would be like below.

find ${LOCAL_CACHE_STORE} -name *.cacnk | wc -l
104
find ${SSH_SERVER_STORE} -name *.cacnk | wc -l
966

The number of blobs in ${LOCAL_CACHE_STORE} will increase on access to the rootfs. After executing top command inside the container, the number of cached blobs would increase like below.

find ${LOCAL_CACHE_STORE} -name *.cacnk | wc -l
142
find ${SSH_SERVER_STORE} -name *.cacnk | wc -l
966

ktock/container-bootfs