containers/composefs

handle nested whiteouts

cgwalters opened this issue · 12 comments

See ostreedev/ostree#2712

Basically...we're now shipping support in ostree for embedding whiteouts. This allows c/image (podman etc.) to be directly pointed at this alternative root in a read-only fashion. Container images shipped this way are "lifecycle bound" with the host (and gain benefits of dedup actually and the efficient ostree on-the-wire deltas).

But this only works because ostree itself doesn't use overlayfs (ostree actually predates overlayfs).

In a composefs future, because overlayfs doesn't nest, we're going to need to figure out how to handle this fun special case.

In a unified storage world things are inherently better here, but hard requring that would actually be an "API break".

I guess another way to say this is that everyone turning on the ostree composefs support is just going to break if they have nested containers today.

So...half baked thoughts. We can clearly detect this situation in ostree when we're generating the composefs image. I think probably what we need to do here is fall back to continuing to create that sub-portion of the filesystem tree as hardlinks outside of a composefs mount (which we already do today actually). Then, we set up a dynamic bind mount from there to that hardlinked subtree.

I think that'd work...

I don't think this is the only issue we're going to have, is it? I mean, you also can't store overlayfs xattrs in the composefs image.

@amir73il Is there a way to "recurse" overlayfs dirs for this? I.e. have an overlayfs mount that looks like it contains files such as whiteouts and regular files with overlayfs xattrs set on them?

I wonder actually if we should have libcomposefs error out if the source data specifies such files. Currently they will produce pretty broken resulting images.

Generally, overlayfs mount doesn't accept an overlayfs as the upper/work dir.
There are scattered assumptions about upper/work fs not being "wierd" and I don't know for sure what all these assumptions are. Need to consult with Miklos if this is important enough.
Note that the requirement that upper/work fs are not "remote" have already been relaxed to allow to support fuse as upper fs.

@amir73il These would be lower dirs though.

Basically, suppose /usr is on overlayfs (via composefs), and you're shipping a copy of some /var/lib/containers/storage/overlay instance in /usr/share/additional-images. You then want podman to combine the lower dirs in /usr/share/additional-images with upper and work dirs in /var/lib/containers/storage/overlay.

In this case, if a file in the container storage has e.g. a metacopy xattr set, we want the container overlayfs mount to "interpret" that, not the composefs overlayfs mount.

Basically, we would want something like mkcomposefs converting the overlay.metacopy xattr to overlay.escape.metacopy and then have overlayfs expose xattrs called overlay.escape.foo as overlay.foo in the mount.

And same with whiteouts, we would like a way to store a file in a lower that overlayfs doesn't consider a whiteout, but looks like a whiteout in the resulting overlayfs mountpoint.

Since ovl already has a mount option userxattr to replace trusted.overlay prefix with user.overlay and all the code used helpers to build the actual xattr names, easiest would would be to support option xattr_prefix="user.cfs" or something like that (could also be trusted.cfs).
Regarding whiteout, simplest would be to support xattr overlay.whiteout. no need to support this for creating whiteouts, only for lookup in lower layers, so should be pretty simple. (I think )

Since ovl already has a mount option userxattr to replace trusted.overlay prefix with user.overlay and all the code used helpers to build the actual xattr names, easiest would would be to support option xattr_prefix="user.cfs" or something like that (could also be trusted.cfs).

That sounds like its easy to implement, but it would mean we would have to use the new prefix for all the files in the lower, and that would add a hard requirement on the new kernel with this support to be able to mount such a file. I'll give it some thought.

Regarding whiteout, simplest would be to support xattr overlay.whiteout. no need to support this for creating whiteouts, only for lookup in lower layers, so should be pretty simple. (I think )

Ideally I would like to avoid checking for a new xattr on all files, as that would slow down everything. I'm thinking we can use an xattr for regular whiteout files, which if sets means that we don't actually treat them as whiteouts.

This issue also relates a bit to containers/storage#1608 in that ideally we support fully general transformations between what is interpreted by the Linux kernel (e.g. the security.selinux xattr as read by the in-kernel SELinux LSM) versus what happens when userspace calls getxattr() and setxattr()).

Regarding whiteout, simplest would be to support xattr overlay.whiteout

This is short for trusted.overlay.whiteout right? But yes reducing this problem domain to just xattr nesting makes sense to me.

Maybe. Not sure how you plan to do this escaping in a nested way.

I'm uncertain whether it makes sense to leap from not supporting this to a fully general nesting. I think this problem is a bit like nested KVM - the driving use cases for one level of nesting became big enough that it really made sense to productize. But having three levels (or more) is far more obscure, enough that I don't know if it's used for anything real.

That said, wouldn't general nesting be implementable by having overlayfs detect the case where it's reading from an overlayfs lower and process "one level" of trusted.overlay.escape?

Anyways all this aside...hmm...so I think my original half baked thought of "un-nesting" would likely work except that we'd need to synthesize a local cfs image. In theory we could pre-compute the digests of these nested cfs image and include a signature covering them in the main cfs image. Or we can just rely on chain-of-trust from the main rootfs to code synthesizing the nested cfs images.

Some initial work on nesting are here: https://github.com/alexlarsson/linux/tree/ovl-nesting

This issue also relates a bit to containers/storage#1608 in that ideally we support fully general transformations between what is interpreted by the Linux kernel (e.g. the security.selinux xattr as read by the in-kernel SELinux LSM) versus what happens when userspace calls getxattr() and setxattr()).

I'm not sure exactly how this applies to composefs. There are two files, the file on the erofs image (A), and the backing file in the CAS (B).

A can have whatever security.selinux xattr we want (we just write it to the image) as overlayfs will expose it to the user as is. The user must pass security checks against this context to open the file

B needs to have a pretty widely accessible context (at least for reading), so we can share these files between images. It will be checked against the context of the user that mounts the composefs filesystem.

Maybe. Not sure how you plan to do this escaping in a nested way

Nesting escapes is easy. The first layer will convert overlay.escape.escape.metacopy to overlay.escape.metacopy, the second with convert overlay.escape.metacopy to overlay.metacopy, and the thrird will interpret overlay.metacopy. However, this is really limited by the current stacking limit of 2 levels.