shadow-maint/shadow

subid ranges sourced from the network store

rhatdan opened this issue ยท 164 comments

We are seeing a lot of excitement on the podman front running containers as not root.

We are taking advantage of User Namespace and specifically shadow-utils with /etc/subuid, /etc/subgid nad newuidmap and newgidmap.

But we are now being contacted by "enterprise" customers who use large databases of users and they want these files and UIDMap information to be handled via ldap or FreeIPA.

Has there been any thought into making this info available via nsswitch?

I think there should be a way where we split the 32 bits uid/gid, so that a UID gets the additional UIDs/GIDs where the high 16 bits are equal the the UID itself.

i.e.
user 1000 -> 65536000-65601535,
user 1001 -> 65601536-65667071,

shadow shouldn't enforce any such policy decision like that when it comes to id space segmentation. it can make best practice recommendations, but that's it.

i'm not sure we can utilize nsswitch w/out coordinating with glibc. but if glibc did add support for new keywords (like "subuid" and "subgid"), then that seems like a design that would work.

The broader question is one about APIs. The existing glibc APIs that manage UIDs/GIDs trigger the NSS infrastructure to load and parse /etc/nsswitch.conf, this in turn loads plugins to respond to service requests in an authoritative fashion e.g. LDAP NSS module. There are no APIs that deal with subuid, subgid, or the concept of newuidmap and newgidmap setup for the guest namespace.

The existing infrastructure would require changes like this:

  • Create a new API for managing subuid/subgid information.
  • Convert newuidmap/newgidmap to use the new API.
  • Hook the new API into NSS.
  • Extend existing NSS plugins to provide the requisite information for the new API.
    • Start with files reading /etc/subuid and /etc/subgid.
    • Extend LDAP NSS plugins next.
      • Need to decide where the information lives in LDAP.
    • Add nscd cache to ameliorate loading long user files from the network.

Is that what everyone is thinking about?

Yes this is exactly what I was thinking.

I should note that when I said "new API for managing" I meant only that we provide functions that allow you to query the existing data, but not modify that data. We don't need dyanmic assignment to be baked into glibc, that has deeper policy implications, and shadow-utils and admins can do that themselves. Likewise the LDAP admin can setup the ranges as they wish without any need to have an API that does the assignment.

this is very interesting to academic environments as well, which use RH-based distros (CERN Scientific Linux 6 / CERN Centos 7) and have shared clusters to which users can login

I have been talking about something like this with @poettering just a short while ago. On the @lxc side we allow for isolated idmaps, i.e. we have a way to give each container an idmap that is isolated from all other containers. For LXD it's easy to keep track what maps it has given away and how many it has left but it obviously becomes a problem when another container manager is using the same map. Having a central registry where we can - in a race free manner - record something along the lines of:

<container-identifier> <starting-host-uid> <range>

would be quite helpful.

I actually think we'd want something even better such that we can query:

  • has this idmap been given away already (does it overlap with an existing mapping) and if not register it right away (this whole process should be transactional)

This should see input from all active people who co-maintain @shadow-maint with @hallyn and myself. Would also good to hear what @stgraber thinks.

Lets not confuse two different things though.

We have the UID's allocated to users to for their user namespace. Then we have UID Ranges allocated for root running services that want to use User Namespace for separation.

This Issue is more about the UID's allocated for Users.

The two are conceptually identical only their permissions differ. If you use new{g,u}idmap for both user and root services than sub{g,i}id decides what you are allowed to map independent of whether you're root or not

Well I am not suggesting we use newuidmap for both. No reason to use this for a root running container engine.

We should have a central way for all container engines to register their id allocations. If it works for unpriv users it works for root as well so there's no additional work associated with this.

I must say I am not particularly convinced that /etc/subuid and /etc/subgid is such a great idea in the first place. Storing these registrations in /etc as if they were system configuration sounds wrong to me, we shouldn't do that anymore. Unless something is being reconfigured /etc should really be considered read-only, and range registration in /etc is something diametrically opposed to that, as it stores runtime information among the configuration in /etc.

In systemd we allocate user IDs dynamically at various places, including in nspawn's userns support and for DynamicUser=1 support for system services. But we never ever write this to /etc, as that's really not suitable for dynamically changing registrations. However, we do supply glibc NSS modules that make sure that whatever we register (regardless if individual uids or ranges of uids) shows up in the system's user database. And I think that's a general approach to follow when allocating user ranges: use NSS as registration database: make sure that the user and all other apps see that you own a range by making sure your users show up in NSS. This reuses existing concepts for maintaining registered ranges (as libc getpwuid() and getpwnam() will just report our entries), and is also friendly towards users, as for example "ps" will show all processes of a userns-using container as owned by your package. It also makes sure that classic user mgmt tools such as "adduser" automatically respect your uid range registrations, since they already check NSS before picking a UID anyway.

hence, from the systemd PoV: I am very sure we'll never begin using /etc/subuid and /etc/subgid, I think at this time we really shouldn't add any more static databases in /etc that need to be dynamically managed. Instead, we just pick a UID we consider suitable, check against NSS, and only use it if its not listed there yet (if it is, we pick a different UID). At the same time we make the UID we now took possession show up in NSS so that everybody else knows.

Or in other words: instead of trying to get everybody on board with sharing a new set of database files in /etc/, and then extending it for the network, just make everybody use the same (already existing) API instead (i.e. glibc NSS), and leave it up to the packages to register their ranges with it. Standardize on existing APIs rather than new files. The packages can then decide on their own how they manage their assignments and replicate them across the network.

(In case you wonder: yes, it's very easy to write an NSS module that returns for a UID x from some range a fixed user name "foobar-x", and vice versa)

Well that is exactly what this issue is about. Adding NSS support to newuidmap and newgidmap.

I must say I am not particularly convinced that /etc/subuid and /etc/subgid is such a great idea in the first place. Storing these registrations in /etc as if they were system configuration sounds wrong to me, we shouldn't do that anymore. Unless something is being reconfigured /etc should really be considered read-only, and range registration in /etc is something diametrically opposed to that, as it stores runtime information among the configuration in /etc.

I think you misunderstand these files. No one intends to use them as databases and they aren't used as such now. They are config files that statically tell you what ids a user can use.

No. I say: don't bother with uidmap/gidmap. Just use the regular NSS user/group db, and fill it through your own NSS module. I would advise podman to simply not bother with uidmap/gidmap, but just provide an NSS module that exposes the ranges it took possession of.

I think you misunderstand these files. No one intends to use them as databases and they aren't used as such now. They are config files that statically tell you what ids a user can use.

so they are an extension of the usual user database. And I argue that the usual user database should not be considered configuration. I mean, there's a good reason while all those new OS approaches (such as Atomic and stuff) try hard to find alternatives to having to write every user into /etc.

That's not orthogonal though, as you suggest. The idea is that you would want a way to allow a specific set of ids to be delegated to an unpriv user and these delegatable ranges are recorded in a central place: subid files. That's not opposing the db.

don't bother with uidmap/gidmap

That's not possible without regressing the ability of unprivileged users to create complex id mappings that have been delegated to them by the system administrator. This has also worked independently of systemd and on other systems so I wouldn't want to make this systemd's job too.

Well, i mean, you can always keep the db if you really really like to, but what I am saying is: the db that everybody should check is the existing NSS user/group database, and not subuid/subgid.

I mean, if you want to use newuidmap/newgidmap as your SUID binary of choice to configure your /proc/$PID/uid_map then by all means, go ahead, but also: everything else is fine too, and I'd not bother with telling people the they have to reg there ranges there. Instead, just let people use any tool they want, as long as they reg the ranges in the NSS user/group databases.

or in other words, I'd suggest buildah/podman to just ship their own tool to acquire a uid range (possibly with a suid binary of their own, or through ipc-based privileged separation), and make sure to register what they acquire in NSS, instead of pushing everything down to /etc/subuid + /etc/subgid, which means you can never use buildah/podman in an environment with read-only /etc...

That's not possible without regressing the ability of unprivileged users to create complex id mappings that have been delegated to them by the system administrator. This has also worked independently of systemd and on other systems so I wouldn't want to make this systemd's job too.

I think I am repeating myself here: I am proposing to use the glibc NSS user/group db as place to make registrations show up, and as place to guarantee that every package uses its own range. Nothing systemd specific in that at all. glibc is not a systemd project, and by doing that you create a solution working on all general purpose Linux systems that support NSS, and there's nothing systemd-specific about that.

or in other words, I'd suggest buildah/podman to just ship their own tool to acquire a uid range (possibly with a suid binary of their own, or through ipc-based privileged separation

You're not really suggesting that we start shipping custom suid idmap binaries alongside every runtime when we have newidmap to avoid just that?

Well, i mean, you can always keep the db if you really really like to, but what I am saying is: the db that everybody should check is the existing NSS user/group database, and not subuid/subgid.

Now you're dancing around the problem: We currently have a mechanism to delegate id ranges to unpriv users. The db registration is about registering ranges and that proposal is fine. But we still need a way to delegate ranges.

You're not really suggesting that we start shipping custom suid idmap binaries alongside every runtime when we have newidmap to avoid just that?

Well, if you use a suid binary that's up to you. Major distros have the goal to minimize the number of suid binaries, and in that context it might be a much better idea to use something that uses some ipc priv separation instead. But the point I am making is this: secondary databases that noone but the tool owning it check are excercises in making UID collisions happen. The problem of dynamic UID registration is not specific to userns, and the tool newuidmap with another db in /etc might not be the ultimate solution to even the userns case.

Now you're dancing around the problem: We currently have a mechanism to delegate id ranges to unpriv users. The db registration is about registering ranges and that proposal is fine. But we still need a way to delegate ranges.

do we though? why do you want static delegation of ranges at all? i mean, podman could have a tiny ipc service (or suid binary if you want) that has one operation: "pick a free uid range that is currently not defined in the NSS user database, register it there, then chown these files with them and initialize uid_map of that process with them". and there you go: everything is properly registered, fully dynamic, without collisions, without maintaining a static database, without writing to /etc...

Why maintain a static database (and propagate them through the network) when you don't have to?

So what we could do is

  1. implement getuidmap and getgidmap binaries.
  2. consider adding a small libshadow which would define add_uidmap, add_gidmap, get_uidmap, and get_gidmap C library functions. All other languages could trivially add mappings. Of course the caller would have to be already privileged for the set functions

I'll refrain from commenting on the argument between adding functionality to existing baroque privileged programs versus adding small focused standalone privileged helpers.

The problem with dynamic mapping is that you could end up with unowned files on disk if the mapping does not persist, but the on-disk file do persist. Is there a distinction to be made between a "dynamic" mapping and an "ephemeral" mapping?

Don't we have this issue now? There is no tool like ls -l that figures out that UID=100000 is owned by dwalsh since their is an entry in /etc/subuid

dwalsh:100000:65536

This should be no different if this file is distributed from LDAP or ActiveDirectory or ...

BTW Has their been any movement on this?

Don't we have this issue now? There is no tool like ls -l that figures out that UID=100000 is owned by dwalsh since their is an entry in /etc/subuid

Indeed. I as an administrator can look in /etc/sub{u,g}id to see which user the "unowned" file belongs to, though it would be convenient if ls -l also knew to look there.

dwalsh:100000:65536

This should be no different if this file is distributed from LDAP or ActiveDirectory or ...

Agreed. It would be different if the sub{u,g}id ranges were generated on the fly in an ephemeral non-deterministic manner, as happens in my understanding of how systemd DynamicUsers work. SystemD (Systemd?) gets around this by a combo of

http://0pointer.net/blog/dynamic-users-with-systemd.html :

  1. Prohibit the service from creating any files/directories or IPC objects
  2. Automatically removing the files/directories or IPC objects the service created when it shuts down.

That approach wouldn't necessarily work for container use cases that require persistent storage, or for containers expected to survive a host reboot.

shadow shouldn't enforce any such policy decision like that when it comes to id space segmentation. it can make best practice recommendations, but that's it.

i'm not sure we can utilize nsswitch w/out coordinating with glibc. but if glibc did add support for new keywords (like "subuid" and "subgid"), then that seems like a design that would work.

@brauner so with your glibc hat on, what do you think of this? :-)

Yeah, that would make sense to me. But for that we should get @fweimer's opinion.
@fweimer, if you have a few minutes over the next few days would be great to hear your thoughts. :)

@brauner ok maybe he'll be more interested in looking at code :) I'll think about doing that (though wouldn't be until after next weekend)

It's not really clear to me why this needs to live in glibc. I still think the entire feature is misdesigned and will not work in environments which traditionally use network-based NSS modules for user management. The main problems I see is the limited size (in bits) of the UID/GID space, a perceived need to tightly control UID allocations for compliance reasons, and a lack of isolation between containers that shared user IDs bring with them.

We already have sudo and autofs which have their own service loaders configured by /etc/nsswitch.conf.

I wrote this as a follow up to Dan Walsh's request for some kind of movement here. I haven't moved anything, all I'm doing is trying to summarize a position from the comments and use cases. I do provide some of my own thoughts after a year of going back and forth with @fweimer about this design.

Summary:

I still don't see a strong rationale for adding a complex API and ABI to glibc that would only benefit a very narrow container-specific use case. There is a lot of value in using glibc's existing NSS infrastructure and iterating on a functioning design until we have something working that everyone agrees is meeting user requirements. At that point we could discuss standardizing it within a core library for further maintenance and better overall integration. I sketch out a design below but it's basically a library/daemon and an NSS plugin in shadow which does what everyone needs.

Details:

(1) Network-based NSS environments.

I see no reason why this wouldn't work in a network-based NSS environment. The request is that the initial subuid/subgid data query would be resolved from say LDAP or FreeIPA, and so the values used in the mapping, the derived UID/GID, would be coming from the same network-based configuration that would have otherwise been used in the lookup within the container. One hopes that these values would be self-consistent.

(2) Static vs. Dynamic allocation or Tight vs. loose control of UID allocation.

I do think there is a use case for the tight control of UID/GID allocations, and that dynamically allocating those attributes is going to cause enterprise policy compliance issues. That doesn't mean we cant support both static and dynamic assignment. We should not make decisions that exclude one of these modes of operation.

(3) Limited size of UID/GID space.

The limited size (in bits) of the UID/GID space is a limit that we cannot increase easily, and has always been there from the beginning, this is a policy issue for administrators to decide how many UIDs or GIDs a user needs within the container. Also as a policy issues it may be possible for an administrator to allow overlap, but only they know this in advance e.g. groups that don't share any physical infrastructure etc.

(4) Lack of isolation.

I agree that shared IDs bring a lack of isolation, but so do all shared mounts. I don't see this as a limiting factor. We should not artificially limit what users of our systems can do with their infrastructure.

(5) High maintenance cost and slow iteration.

The initially suggested solution, that of adding a new NSS database for subuid and subgid, would entail the addition of a generic API/ABI for subuid/subgid which would subsequently be used only for a very narrow container-specific use case (specific mapping for UID/GID in the case of CLONE_USERNS). Florian states "It's not really clear to me why this needs to live in glibc" and I echo that general sentiment. Lennart points out that all of this can be accomplished by putting an NSS service module in the container that provides everything you need.

Placing these APIs and ABIs in glibc will impose the requirements of a core library (strong backwards compatibility) and that will complicate subsequent design changes. Iterating on the design should be our first priority when designing something as new as this.


Let me expand on Florian's idea a bit with a few bullet points (@fweimer correct me if I expanded your points along the wrong lines):

  • Add subuid/subgid databases to /etc/nsswitch.conf (unused by glibc)
  • Design IPC to privileged daemon or library (as @hallyn suggests) and database file which supports getting subuid/subgid information from whatever it sees as the configured module in /etc/nsswitch.conf (like autofs or sudo).
    • The IPC or library API can be iterated over quickly to provide such things as queries for registered ranges as @brauner suggests.
  • shadow provided NSS service module run on the host provides proper UID/GID results for files created with the container mappings (talks to daemon via IPC, or via database file)
  • shadow's same NSS service module run in the container provides proper UID/GID results for files created with the container mappings (talks to daemon via IPC, or via database file) as @poettering suggests.
  • newuidmap and newgidmap use IPC to daemon or library with a data file to get mapping information.

... then you'd have one set of suid binaries used by any container runtimes that want to use them, one daemon/library to query for this information, one location the information comes from, still use /etc/nsswitch.conf for centralized configuration of similar service provider information.

In summary:

  • We should iterate the API quickly, use standard interfaces via an NSS module, and expand that to support LDAP/FreeIPA, in whatever way we plan to store that information there.
  • Eventually move this to glibc if we see stable design evolve and we want better overall integration with NSS.

So, the original post by @rhatdan said:

But we are now being contacted by "enterprise" customers who use large databases of users and they want these files and UIDMap information to be handled via ldap or FreeIPA

How do they want to use this information in ldap? Given that there are 65k (-1) 65k allocations at best, would these enterprise customers need more than that? I.e. are we better off looking at using shared uid ranges which are separated using MCS (especially after selinux namespaces are completed), or something like that?

Or will the number of subuids be enough?

I'd really like to hear more about what a service across the network would need this information for. I.e if we want to launch a container on host X from some OCI image, the cluster scheduler shouldn't care about the subuids. It should be able to simply ask host X to fire off a container using the OCI image. Host X's runtime can then choose a subuid range to run the container in, using shiftfs to map in any shared files, and hand unshifted files to cluster services which are gathering results.

In any case, is anyone interested in starting POC patch for nss?

Basically Podman/Buildah is using newuidmap and newgidmap to setup User Namespace. (A container is secondary) Other tools in the future can do the same thing, that is what newuidmap and newgidmap were designed for.

Currently newuidmap and newgidmap ONLY read the local /etc/subuid and /etc/subgid files for the UIDs allowed to be used by a particular user to setup in a User Namespace.

Bottom line, I would like to extend newuidmap and newgidmap to be able to retrieve this mapping from a network datastore. Users of Podman/Buildah want to be able to distribute this information on their networks. Think universities with distributed use of containers in their environment.

Since there is a limited number of UID/GIDs and at 65536/per container you could only allocate 65536.0 ranges.

4294967296/65536
65536.0

Of course this number is probably a lot larger then required, most user containers would work fine with a couple of thousand UIDs.

I don't want to tie this tool to any other tool like MAC, since obviously users disable SELinux and other MAC tools.

I am not tied to nsswitch to do this. If newuidmap/netgidmap would talk to sssd or systemd to get this information and sssd or systemd talked to the centralized datastores like LDAP/FreeIPA, it would be fine.

Bottom line we have tools and features of the OS that take advantage of User Namespaces, that uses UIDs/GIDs. Almost all other UID/GID databases built into the base OS and more specifically shadow-utils are available via the network using sssd, LDAP, and FreeIPA, except the content of /etc/subuid and /etc/subgid.

@ebeiderman

abbra commented

If newuidmap/newgidmap would make a plugable interface that can be used to load a specified dynamic module, we (FreeIPA/SSSD) can do delivery of that information. SSSD also has support for a local user database and is capable to store overrides/additional attributes for each of those users, so storying subuid/subgid maps is not a problem even for the local configuration. On FreeIPA side, adding a new type of a map and associating it with a user/group is definitely doable.

abbra commented

From practical side, SSSD has a library, libsss_nss_idmap, that provides a number of extended interfaces to query information about user or group from SSSD cached database. Among others, it has this function:

/**
 * @brief Find original data by fully qualified name
 *
 * @param[in] fq_name  Fully qualified name of a user or a group
 * @param[out] kv_list A NULL terminate list of key-value pairs where the key
 *                     is the attribute name in the cache of SSSD,
 *                     must be freed by the caller with sss_nss_free_kv()
 * @param[out] type    Type of the object related to the given name
 *
 * @return
 *  - 0 (EOK): success, sid contains the requested SID
 *  - ENOENT: requested object was not found in the domain extracted from the given name
 *  - ENETUNREACH: SSSD does not know how to handle the domain extracted from the given name
 *  - ENOSYS: this call is not supported by the configured provider
 *  - EINVAL: input cannot be parsed
 *  - EIO: remote servers cannot be reached
 *  - EFAULT: any other error
 */
int sss_nss_getorigbyname(const char *fq_name, struct sss_nss_kv **kv_list,
                          enum sss_id_type *type);

Calling it would return you a set of predefined attributes known by SSSD (name, uid, gid, gecos, home dir, shell, expiration, certificate, ssh public key, email, DN and so on) if they were cached. We can add a variant of this call that would allow to explicitly ask for a specific subset of attributes -- the plumbing is already there and is used by SSSD's infopipe interface exposed over D-BUS.

It needs a bit of work to make this info available in a default configuration but nothing too fancy.

$ cat sss-idmap-test.c 
#include <stdio.h>
#include <strings.h>
#include <string.h>
#include <sss_nss_idmap.h>

int main(int argc, const char** argv) {
	struct sss_nss_kv *kv = NULL;
	enum sss_id_type type;
	int result = sss_nss_getorigbyname(argv[1], &kv, &type);

	switch (result) {
	case 0:
		for (size_t i = 0; kv[i].key != NULL; i++) {
			printf("%s: %s\n", kv[i].key, kv[i].value);
		}

		sss_nss_free_kv(kv);
		break;
	default:
		printf("error: %d\n", result);
	}

	return result;
}

This would be fine with me, if it was ok with the Shadow-utils guys?

Does 'a pluggable interface' just mean a .so implementing a particular function?

If so then I think that sounds good to me.

abbra commented

@hallyn yes, a .so to implement the interface would be enough.

Thanks, unless someone else wants to do so, I'll look at implementing that in the next few days.

For SSSD users, could the subuid and subgid ranges be calculated similar to the UID and GID? That is, they are globally unique as a function calculated from the users AD SID?

No I don't think so, since there is only a total of 4Billion uids.

@hallyn Or any sssd people, has anyone moved on this?

For sssd, ranges could be calculated for 4M users, if only 1024 uids per user were allocated.

These could be used for 0-1004,1998-2002,4998-5002,65526-65534 or similar, to cover most existing containers, or could be allocated as-needed per container. Likely no container would actually use more than 256 UIDs.

mheon commented

While the actual numbers of UIDs in use are very low, they end up being quite scattered; we absolutely need a number greater than 1000, so we can support root, system users, and a few user-created users above 1000. It would be interesting to go through commonly-used container images and identify the maximum UID in use by each; I suspect that around 2048 would allow most container images, but I do see 655xx UIDs occasionally (anecdotally, they seem more common than UIDs >65536 that we already do not support with the default amount of UIDs we allocate).

Hi,

I would prefer to store the data the the LDAP user objects of the related users.

While in theory it would be possible to let SSSD assign the ranges and there would be even room in the UID space because currently SSSD tries to avoid to assign ID between 2^31 - 2^32 (2 billion - 4 billion), I'm not sure if this would be practical.

We would need to new configuration options one which defines the allowed login UIDs, one for the range the subordinate IDs should come from and the number of subordinate IDs for each login UID and the related options for GIDs. Then each login UID can be assigned a range of subordinate UIDs and GIDs (if the range for subordinate IDs is large enough). With this config option SSSD would be able to reproducible assign the subordinate IDs even after all temporary data (e.g. SSSD's cache) are lost and it would works on multiple hosts as well as long as the same config options are used.

The pain point I can see here is how to specify the allowed login UIDs since they can be quite scattered as said before. So in the worst case this would be a list of a couple of hundreds or even thousands of UIDs (or names to make admins life easier, but if the users are coming from AD you need fully-qualified names, i.e. including the domain component, to make sure this works properly in a forest). Of course allowed login UIDs can be specified by ranges as well, but if this range gets too large there might be not sufficient space left for the subordinate ID range and the required number of subordinate IDs per allowed UID.

Given that it should more flexible and more straight forward to me to manage the subordinate ID in the LDAP user objects.

HTH

bye,
Sumit

The pain point I can see here is how to specify the allowed login UIDs since they can be quite scattered as said before. So in the worst case this would be a list of a couple of hundreds or even thousands of UIDs (or names to make admins life easier, but if the users are coming from AD you need fully-qualified names, i.e. including the domain component, to make sure this works properly in a forest). Of course allowed login UIDs can be specified by ranges as well, but if this range gets too large there might be not sufficient space left for the subordinate ID range and the required number of subordinate IDs per allowed UID

The subuid range should be contiguous, and a fixed-size block per user. I was only referring to how that per-user block could be efficiently used by a container engine, by splitting it within a container, which is out of scope for both shadow and sssd.

Will this relate to systemd-homed varlink API? https://systemd.io/USER_GROUP_API/

We have been discussing this more, and the more I think about it, I believe we need an API for these files, preferably supplied by glibc. We have been talking about the problem of getting a range of UIDs for the user from a remote site, but as we use this more and more, I realize that their is a hole in the backwards lookup, that might even be considered a security hole.

We now have files in a users homedir that the standard tools of find and ls can not identify where they came from.

If I have a file owned by UID 100002 in my homedir and run ls on it I have no idea this came from a container. Another big use case would be the audit subsystem. If a process in my user namespace goes and triggers and audit event on the system and the administrator looks to see who owned the process that created the audit event, their is no standard way to back trace that audit record back to a process owned by dwalsh.

We could end up with lots of tools building their own read to read these files. Podman, Buildah, Systemd and newuidmap, newgidmap. already do this. We could end up with findutils, coreutils, util_linux, audit, procps and any other tool that looks at process or files and wants to reverse map where who owns these objects.

At least if we had a standard library that could look up these data, I could start to bother the low level tools to start revealing this information.

You also need to know which container created the file owned by the UID 100002 since interpretation of these UIDs is container-specific. For some tools, mapping them to a user-specific range is not enough.

I think what we eventually need is support for stacking user IDs directly in the file system. So from the host perspective, these files will have the primary user ID of the user, but once the user enters the appropriate namespace, that user ID becomes invisible and the user ID beneath is revealed. There are reservations about implementing in the kernel (due to performance and complexity), but it will happen eventually. At that point, the range mapping becomes a legacy interface, so I don't think it will see long-term use (and it takes three years or more until new glibc interfaces land in distributions, so this is really not the right venue for this).

No, you did not miss anything. But the currently proposed alternatives look so hackish to me that I just can't see that they will last for years to come.

Here's the part your email notification might have missed.

..." becomes a legacy interface, so I don't think it will see long-term use (and it takes three years or more until new glibc interfaces land in distributions, so this is really not the right venue for this)."

Well they have been working on the auditing problem for years for tracing containers, but I see this is a much more tangible problem.

I can simply to

$ podman unshare sh -c "mkdir baddir; touch baddir/badfile; chown 1:1 -R baddir"; ls -l baddir/badfile; rm baddir/badfile
-rw-r--r--. 1 100000 100000 0 Mar 27 06:02 baddir/badfile
rm: remove write-protected regular empty file 'baddir/badfile'? y
rm: cannot remove 'baddir/badfile': Permission denied

Now i have a directory and file in my homedir, that is not easily identifiable who/what created it.

Just running podman for a while, I find these many files in my homedir that are not owned by me.

find ~/.local/share/containers ! -uid 3267 2> /dev/null | wc -l 
25468

Had to pipe errors to /dev/null because of all of the errors in directories I could not even examine.

find: โ€˜/home/dwalsh/.local/share/containers/storage/overlay/cc4590d6a7187ce8879dd8ea931ffaa18bc52a1c1df702c9d538b2f0c927709d/diff/var/cache/apt/archives/partialโ€™: Permission denied

I agree that the containers tend to be stored in the same directory. But volumes are not, they can be mounted from anywhere including /tmp. And content can be created in these directories by the non root (User UID) user.

The Auditing patch has been worked on for YEARS. I am not confident that it will get merged soon.
And I am not sure this is easily mapped back to the user who launched the namespace.

Being able to examine a file or process on the system and know that it is owned by dwalsh or dwalsh(UIDS) is very important...

@rhatdan while 'volumes' (I assume this is docker parlance, and you mean a bind mount?) can come from anywhere, i'd consider it unsafe to do that without any sort of structure. After all, if two containers are sharing the same root user subuid, then one can make a setuid exploit for another.

Furthermore, I suspect most people bind their host uid into the container. So the container can create files which appear to have been created by the user.

So, earlier you said

We could end up with lots of tools building their own read to read these files. Podman, Buildah, Systemd and newuidmap, newgidmap. already do this. We could end up with findutils, coreutils, util_linux, audit, procps and any other tool that looks at process or files and wants to reverse map where who owns these objects.

At least if we had a standard library that could look up these data, I could start to bother the low level tools to start revealing this information.

Agreed. I'm happy create a tiny libshadow or libsubuid to do the local version of this, then (once we all agree on api) we can re-visit extending to the network? I'll write a straw-man this weekend or monday.

I've pushed a strawman api with a single function to start with at https://github.com/hallyn/shadow/commits/libsubid .

#ifndef SUBID_RANGE_DEFINED
#define SUBID_RANGE_DEFINED 1
struct subordinate_range {
	const char *owner;
	unsigned long start;
	unsigned long count;
};

enum subid_type {
	ID_TYPE_UID = 1,
	ID_TYPE_GID = 2
};

#define SUBID_NFIELDS 3
#endif

int subid_get_ranges(char *owner, struct subordinate_range ***ranges, enum subid_type which);
void subid_free_ranges(struct subordinate_range ***ranges, int num_ranges);

Let the bikeshedding begin :)

Well most Volumes in Podman/Docker world are bind mounts, yes. So basically doing something like

$mkdir /tmp/db; $podman run -v /tmp/db:/var/lib/mariadb mariadb

Could create content of the mariadb UID inside of the container controlled by dwalsh.

For your API I would prefer a function that did something lik

subid_getown_byuid(UID uid_t, char **owner)
subid_getown_bygid(GID gid_t, char **owner)

For your API I would prefer a function that did something lik

subid_getown_byuid(UID uid_t, char **owner)
subid_getown_bygid(GID gid_t, char **owner)

Thanks, Dan - to be clear, you mainly mean drop the extra id_type argument?

No, you don't ...

what's the return type of those?

/etc/sub*id accept both the user name and the user UID (foo:100000:65536 and 1000:100000:65536)

I think it is better if we use the UID instead of the user name. In this way the caller doesn't have to worry whether the output is a user name or a UID that must be directly parsed.

The API could be something like:

int subid_get_subuid_owner(uid_t uid, uid_t *owner);
int subid_get_subgid_owner(gid_t gid, uid_t *owner);

Yes I agree with Giuseppe. I need to get back which UID owns a particular file, then I can call in an translate it to a real user

ls -l containerfile.txt
To be able to show that it is owned by dwalsh

@giuseppe - why do you think it's better to accept only UID, instead of either username or UID?

I'll change the function to return an array of subordinate_id structs, ending with a NULL entry.

Oh. You're asking for a different function. Yes, that can be added, and I expect to also add "allocate the next unused subid range of size sz to uid N".

Yes that would be nice also. With containers we are attempting to preallocate a huge range and then just playing in that range. 2B-4B range. But being able to do this programatically would be nice.

abbra commented

This will not work well with centralized identity systems. We cannot request (or grant) that from unprivileged client to give a slice of UID/GID space centrally. On the other hand, if that space is ephemeral, it can be allocated locally by SSSD and maintained there.

@abbra I'm not quite following. Why would it not work well with centralized identity system, be it ephemeral or longer term? Of course the system should be privileged, but that's implied by its being centralized.

Perhaps if you could elaborate on how you imagine using this, it would help me understand.

In the next few days I'll update my branch with a more complete API, and I'll post a "[WIP]" PR.

abbra commented

@hallyn the way I read @rhatdan's comment

With containers we are attempting to preallocate a huge range and then just playing in that range. 2B-4B range. But being able to do this programatically would be nice.

is that this happens on a runner where the container runtime is executing a container instance. That system is unprivileged from the perspective of centrally managed IdM deployment. In order to allocate the UID/GID space in the IdM, one would need to have enough privileges and it is unlikely that such privilege information could be passed from the application that issues this request through the library you are implementing.

There are potentially two competing use cases for the UID Mappings.
The one you are concerned about is users logging in with a shared IdM for any user that logs into the system. In this case Podman/Buildah are just running in the users user namespace and need a range in order to support multiple UIDs in his homedir.

The other use case we are looking at is a root running process that is creating lots of containers (Kubernetes&CRI-O or root running # podman run --userns=auto ...) where we just want every container to run in a different user namespace for security reasons.

In the first use case, you might also want to use usernamespace for separation of containers launched by the user, but you usually have a much smaller base of UIDs to work with.

For the first case, the login manager could create a user namespace (with privilege) on login, right?

For the second case user is already privileged.

If what you want is a shared pool from which unprivileged users simply 'borrow' uids on an ephemeral basis for each run, that would need to be something built on top of this (and would of course need to entail some way of handling clearing of all files which were created ephemerally assigned subuids on logout or end of container run).

A shared pool from which subids can be checked out without any privileged help does not belong in shadow, at least not this simply. What I'm doing right now is writing the basic library which all tools could use to query and manipulate subid allocations without stepping on each others' toes. Maybe we'll even end up creating a subuid borrowing tool in shadow. And it might work hand in hand with fsuid shifting like @brauner is working on.

Anyway I'll add the rest of the needed functions to the library, open the PR, and then we can discuss more there.

abbra commented

For the first case, the login manager could create a user namespace (with privilege) on login, right?
For the second case user is already privileged.

So these both cases still address a single machine and that machine has no actual administrative rights for the central IdM system to add ID ranges. In centralized setup administration happens in advance and ID ranges get looked up and used on the machine.

For global, centralized storage we would expect:

  • have a central place to define cluster-wide ranges,
  • allow assigning them to users that can utilize them on any machine they allowed to use those ranges on,
  • have a central place to define cluster-wide ephemeral ranges,
  • allow local use of ephemeral ranges that do not conflict with any cluster-wide ID ranges regardless to which user they locally assigned

The step of defining and allocating those ranges would be separate from the consumption. I guess, to cover the two use cases mentioned by @rhatdan we would basically need to be able to look up the ranges for user at login for one and have a shared 'borrowed' ephemeral range allowed for use on the machine.

The first part can be added in FreeIPA in a way similar to how we added SELinux policies: there are rules that associate SELinux policy with user for a host/hostgroup and they get applied on login. The difference with SELinux policies is that SELinux contexts were pre-defined by the distribution and here we want to have semi-dynamic allocation but that's a technical difference, conceptually we deal with a similar beast.

User-specific ranges defined in advance, may be on a user/privileged account request, based on the total known state of ID ranges in the centralized system. A space is carved out and associated with the user in FreeIPA once -- this might be kind of self-service 'allocate-once' style for the user, then consumed by the login process everywhere.

Ephemeral ranges can be defined by admins for the whole centralized environment. They would be pulled by SSSD and thus available for local use -- if we guarantee they only used on the same machine, that should be fine. For cross-machine (NFS/SMB/etc) consumption we need user-specific ranges, I guess.

So, if a user doesn't have a range allocated, at login we can ask for one if user has been authenticated in a way that could be presented to FreeIPA, then ask for own range. This could happen behind the scenes but we need to have a context or a token that could be used by the backend (SSSD?) to request this allocation on behalf of the original user...

That all sounds good and can all be done on top of libsubid. But again, if subid ranges are assigned for just one login session at a time, then the files created during a login session will need to be taken care of at logout. This could get interesting.

I would think the range/per user/machine would be permanent, Since content can be added to the users home-directory. Only to be freed when the user account is removed. Similar to standard handling of UIDs.

Advanced feature in future would be if a privileged process could register a range of UIDs with sssd for a particular container. Or to have a way to register a call back from the greater range

ls -l foobar
Indicates that this file is owned by the CTR3 container.

Then it sounds like we are agreed.

The API could be something like:

int subid_get_subuid_owner(uid_t uid, uid_t *owner);
int subid_get_subgid_owner(gid_t gid, uid_t *owner);

@pixelb would such API work for coreutils? Do you think it will be possible to extend coreutils tools to also include sub IDs ownership?

currently it is:

struct subordinate_range **get_subuid_ranges(char *owner);
struct subordinate_range **get_subgid_ranges(char *owner);
void subid_free_ranges(struct subordinate_range **ranges);

int get_subuid_owners(uid_t uid, uid_t **owner);
int get_subgid_owners(uid_t uid, uid_t **owner);

And tonight I'm going to add a "reserve a new range for uid" call and then open a PR

just creating a link to this PR #250 that needs review.

Hm, I'm going to re-open this briefly. What do we want to do in terms of forwarding these over the network? Just leave it up to all individual callers/users? Provide a dbus service in shadow to wrap it? Something else?

(Looking back at the opening comment)

So let's say we create a new nss 'idmap' database to
do owner <-> subid translations. The first could use
libsubid to return local results, then a second could
query over the network, i guess? How do we want that
to look? Query ldap in some new agreed upon way?

Well the original request was to make this data distributable across the network. Second request was to get a library so that we could translate these UIDs/GIDs on disk/logs back to the OWNERs of the objects.

abbra commented

Right, @hallyn I think it is the other way around -- if you'd provide a way to plug into libsubid via a plugin, then we can supply a plugin that would use SSSD to deliver the data pullable from a centralized place.

@abbra what are you looking for to help plug libsubid ? golang bindings? I'm not quite following...

abbra commented

@hallyn as I said, it is the other way around -- I'm looking at ability to have a separate plugin inside libsuid that can provide ID mapping information, like we discussed in this issue. Right now your code only looks up the data from files in the file system. We discussed that and you agreed in #154 (comment).

(I'm not trying to argue, just not getting what you are saying). You had
said:

If newuidmap/newgidmap would make a plugable interface that can be used to load a specified dynamic module, we (FreeIPA/SSSD) can do delivery of that information

o which I replied

Does 'a pluggable interface' just mean a .so implementing a particular function?

If so then I think that sounds good to me.

That's what libsubid provides.

Are you asking for an nss module? A new nss database?

Well the original request was to make this data distributable across the network. Second request was to get a library so that we could translate these UIDs/GIDs on disk/logs back to the OWNERs of the objects.

libsubid helps with both. It will tell you the owner(s) of a particular
subid, and will tell you the ranges owned by a particular user.

This way, others can more easily expose the information over the network.
I hadn't yet decided whether shadow should do that itself. But I think
I'm ready to give in and do it over NSS. Would that suffice for your
needs?

abbra commented

My understanding is that an application would link against libsubid to get information about sub IDs and to ask for allocating them. libsubid wouldn't know how to provide this information from a remote source and I don't think you should be implementing that yourself. This is what I want to plug to change so that in a centrally managed environment a plugin dynamically loaded by libsubid would redirect the operations to a centrally managed source instead of files.

More specifically, the API you have is currently only backed by lib/subordinateio.c which only implements /etc/subuid and /etc/subgid processing. The API is very much file oriented (lock/unlock, *_file_present(), etc) and cannot be amended to redirect a request, say, to SSSD for lookup/modification.

I think this discussion already went through NSS module option and an agreement was to have one. So may be let's start with that, do an API implementation that loads modules specified in /etc/nsswtich.conf for subid: ... and have /etc/subuid / /etc/subgid handled via a files provider in an NSS module provided by the shadow. Then SSSD and systemd could provide their NSS modules to complement the information and we'll work between SSSD and FreeIPA on how to store and retrieve the information from FreeIPA LDAP store.

Ok, I see. I thought that SSSD, or someone, was going to use
libsubid to implement the network visibility.

I'll think through how best to write the module and ship it in
shadow.

Ok, I see. I thought that SSSD, or someone, was going to use
libsubid to implement the network visibility.

I'll think through how best to write the module and ship it in
shadow.

Please take a note, besides extension of libsubid with ability to support pluggable backends, it is also required to rework newgidmap / newuidmap (and other user facing tools if any) to make use of libsubid instead of files-backed lib/subordinateio (which should rather be used as a basis for "files" plugin).

If I have a file owned by UID 100002 in my homedir and run ls on it I have no idea this came from a container.
We could end up with lots of tools building their own read to read these files. Podman, Buildah, Systemd and newuidmap, newgidmap. already do this. We could end up with findutils, coreutils, util_linux, audit, procps and any other tool that looks at process or files and wants to reverse map where who owns these objects.

At least if we had a standard library that could look up these data, I could start to bother the low level tools to start revealing this information.

Is it ever realistic to make all those tools ("findutils, coreutils, util_linux, audit, procps and any other tool that looks at process or files") using new shadow-utils API (in addition to glibc nss) to resolve issue of "objects owned by unknown UID/GID"?

(I'm not sure if this is a sane idea but) could shadow-utils just (additionally) provide new libnss_shadow.so (backed by libsubid) that would be added to "/etc/nsswitch.conf:passwd,group" as a last resort and would serve getpwuid() calls for sub-ids returning owner id?
In your exapmle "UID=100000 is owned by dwalsh" it would be:
getpwuid(100000)->pw_name == "dwalsh"
This, of course, would be a little bit weird since getpwuid(uid)->pw_uid != uid but IIUC this way no change is required to ls/etc to show proper owner names of objects owned by ids from sub-ranges...

I would want the UID to show something like dwalsh-userns for the user, or some other indicator to say which user namespace is the owner of this file.