cubanismo/allocator

How [de]centralized do we want to be?

Opened this issue · 22 comments

One of the resounding things that came out of the discussions at XDC was the mobile guys saying that they don't want to have to do contiguous memory allocation in their GPU driver. Instead, they wanted a central allocator (such as ION) that would understand all the constraints and be able to allocate an image. Both NVIDIA and Intel badly wanted to do the allocation inside their driver stack in the case of single-vendor interactions so that they could handle all their "magic".

Two options were discussed for resolving this:

  • Require the client to go around to each of the drivers until it finds one that can allocate its image
  • Have a central allocator (similar to gralloc) that understands everything

In the end, we settled on the second option (central allocator) with the understanding that it would have a bunch of pluggable back-ends and would internally walk around to the different back-ends until it found one that could allocate.

This answers the question for the final allocation step, but there are several more steps involved:

  • Convert GL/Vulkan/CL/v4l image specification to allocator spec (VkFormat -> allocator format, etc.)
  • Get allocation capabilities
  • Intersect capabilities and union constraints
  • Allocate memory (central)
  • Lay out the image (highly vendor-specific except for the 2D case)

My preference would be to keep the "get allocation capabilities" portion decentralized so each driver has an entrypoint for it. Why? Because it means that in the common case of an app and a compositor, the only thing that absolutely has to monkey with the allocator is the compositor. The client can just send capabilities and recieve ready-to-import images.

@jekstrand fyi, have a look at https://github.com/robclark/allocator/blob/master/USAGE.md .. I still need to update it for the capability_set_t stuff (ie. keep constraints coupled to capabilities), but it does at least show what I had in mind for the "go around and ask each device to allocate" idea. Where one of the devices could possibly be ION for the weird mobile use-cases. But I think this approach would let us keep allocation in the GPU driver on more desktop(ish) systems.

Basically it is the "pluggable backends" idea, where the pluggable backends are in userspace.

(fyi, also some comments at #3)

How do you know which devices are involved in the capabilities negotiations ?
Can you describe how this could work for example in GStreamer ?

@Benjamin-Gaignard I think that is a question for GStreamer dev's.. I was planning to ask Wim to have a look, but probably not until the concept is a bit more concrete. At least in the gst case, everything is in a single process. Maybe a more interesting question is how this ends up working across processes (for example wayland, plus pinos to separate app in flatpack/snap/etc container from helper process with access to camera).

Short version: some collection of userspace processes are doing something.. so they must know what they are doing. The extra APIs and protocols that need to be developed are left as an exercise for the reader for now ;-)

With dma-buf we already have a cross processes and cross devices (at least in v4l2, drm) mechanism. We could also know which devices are attached to a buffer.
@robclark I don't learn you anything since you have actively participated to dma-buf creation :-)

dma-buf support in gstreamer and wayland is progressing and widely used in Android.
With defered allocation (map_attachment is called after attach) we can find the best way to allocate a buffer without adding to new protocols.

@Benjamin-Gaignard well, the complication is when two devices can exchange vendor compressed/tiled formats, possibly with associated pitch constraints, etc.. that sort of information is not available in the kernel.

I think we need to add a list of memory constraint in device structure with get/set functions.

Those constraints could be implemented like metadata are in GStreamer:
https://cgit.freedesktop.org/gstreamer/gstreamer/tree/gst/gstmeta.h#n180

struct constraints {
struct list_head list;
int type; /
identifier of the constraint _/
int (_init)(struct constraints _c);
int (_free)(struct constraints c);
struct constraints
(merge)(struct constraints *a, struct constraints *b); / merge constraints a and b into one new structure. This function depend of the data type */
}

We can to get the constraints of all devices attached to dma-buf and merge them before trying to find a compatible allocator.

An example of constraint could be the number of possible entries in sg_table for a buffer.

struct segment {
struct constraint;
int nb_segment;
}
with a merge function like this one (pseudo C):
struct constaints * segment_merge(a,b)
{
struct constraint *c = new constraint;

c->nb_segment = min(a->nb_segments, b->nbsegments);
return c;
}

On my platform, in camera preview use case, I have a USB camera with IOMMU so let says nb_segment is huge (0xFFFF) but my display need contiguous memory so nb_segments = 1.
After merging those two constraint I will get a constraint where nb_segment = 1 and be able to select an allocator that fit with that.

We can do the same with memory range, titling, etc...

@Benjamin-Gaignard I did suggest a fairly simplified version of that quite some time back to deal with the simple "do I need contiguous or not" case. But there are more complex cases for constraints about which the knowledge does not exist in the kernel. Like pitch/stride constraints for various formats.

@robclark in GStreamer or Android pitch/stride/alignment are already handled per formats.
GStreamer do that in caps negotiation while Android/Gralloc hard code it for each platform, I don't think we need to reinvent that, right ?
Constraints in kernel should to help to solve where (which memory bank, memory region, contiguous or not) allocation should be done.

@Benjamin-Gaignard At least if you are dealing with dynamic GStreamer pipelines you don't have all the devices that may take part in the buffer sharing attached up front to the dma-buf. Which means if you can't afford to drop the frame (if the source isn't going to produce a new one for a while) you need to actively resolve the buffer to something the newly attached sink can digest. This might be a complex operation (GPU blit or whatever) that really shouldn't be the business of the kernel, but the userspace driving this hardware.

@jekstrand I think allocation should really be decentral as well. After caps/constraints have been intersected you go around asking the backends who is able to allocate with this intersection. This means we might need to smart up some of the underlying device drivers to understand foreign constraints, but that's really more an implementation detail than anything more serious.

From a kernel driver writers PoV it would be nice to have some central kernel internal thing to extend your drivers abilities to allocate memory with foreign constraints. But that part should certainly not be exposed to userspace, other than the GPU driver claiming the ability to allocate contig memory for example.

@lynxeye-dev, Your comment just made me realize that I've been making assumptions about what I mean by "decentralized". I think we have 4 different levels of "centralization":

  1. Kernel (yes, I know, there are multiple levels inside the kernel but I'm ignoring that for now)
  2. Userspace liballoc core
  3. Userspace liballoc driver back-ends
  4. Userspace GL/Vulkan/CL/etc. driver

I think what I'd prefer is to have at least caps/constraints querying all the way down (or up?) in the GL/Vulkan/etc. driver. Initially I wanted to put everything there, but the SoC use-case has convinced me that we need to have at least the final allocation in something more centralized. Obviously we can't 100% centralize it (because we need vendor plugins) but we don't want clients to have to walk around looking for someone to allocate.

@jekstrand Agree on the various levels of centralization. I would argue that your list above should probably swap entry 2 and 3 to get it sorted from the farthest away from the normal application usage (the app arguably should not talk to the kernel directly) to what an application should use for the common usage.

I understand your desire to have caps/constraints querying at the accel API level, but I don't have a good idea how this would work for other devices we might want to integrate with that don't even expose this API level, like V4L. The highest API level exposed by V4L itself is the kernel level, we could have a common driver backend sitting on top of that, but even then the highest API level exposed would be the liballoc core level.

I'm leaning toward the notion that the application (even compositor) may not want to talk to liballoc core if there is a higher API level available. So allocating a Vulkan buffer (or EGL back buffer or whatever) should stay inside the respective API. For a sharable buffer you probably would just pass in the intersected caps/constraints to your Vulkan buffer alloc call with the Vulkan driver talking to liballoc to do the allocation (possibly lopping back into your own driver through your liballoc driver backend).

@jekstrand I think I'd go for your entry 3 (although I think of them more of alloc device instances, ie alloc_dev_tin USAGE.md. I don't think "liballoc core" should actually be a thing. Because in some cases (unless we had a central allocator daemon that had permissions to open all of the devices) you are not going to have all the alloc_dev_ts in the same process. And we want to leave that up to someone above us to know how to serialize (wayland/binder/whatever)..

Also, fwiw, current android builds using freedreno/virgl/vc4 are not using ION.

@lynxeye-dev re: v4l or other APIs which are purely kernel level, idea was to have a generic/common backend for them (ie. liballoc-v4l.so backend which knows how to use v4l uapi to query supported formats and whatever else). Have a look at USAGE.md. (Although I should probably extend that a bit to show getting the alloc_dev_t via some hypothetical EGL extension for case of a gl driver..)

@robclark First off, I should say that I don't intend liballoc (I guess that's what we're calling it now) to be very thick. For most operations, it will simply look up the function pointer table in the alloc_dev_t and call through to the back-end. It may be useful to have some back-ends "built in" but that doesn't mean we have a big core, it really just means that you get them for free without having to load a .so.

I don't really like now heavily your USAGE.md examples use alloc_dev_t. It feels very EGL to me and not in a good way. It may be unavoidable, but I'm not convinced that it is. Ideally, I would like the decentralization (I'm talking about API here, not implementation) to be such that

  1. No client ever has to explicitly open a device that they are not actively using. In other words, get rid of the #ifdef HAVE_ION stuff. If ion is needed, it should get loaded automatically.
  2. You never have to pass an alloc_dev_t across the wire. There are no guarantees that the driver on the other side is the same version so this isn't safe to do.
  3. Users of higher level APIs such as GL or Vulkan shouldn't have to touch liballoc unless they need to communicate with other components.
  4. As an corollary to the above, a client (not server) shouldn't need to touch liballoc just to do WSI. We may use liballoc to implement WSI, but users should be unaware.

The first two I consider to be pretty hard requirements. The last two could be done by providing a small amount of (possibly shared) driver sugar on top. Depending on how the API shapes up, the driver sugar may not be worth the effort and it may (probably?) be easier to just do it in the driver directly.

In order to make this work, we sort-of have to assume that the server can always open some device that can do the allocation (and I think we're stuck in server-allocated land but I don't think that's necessarily bad). Also, it assumes that the cases where one of the drivers in the exchange can't do the allocation are rare and that they're basically all solved by ION. I think those are probably reasonable assumptions.

@jekstrand You're saying you want caps/constraint querying done only through GL/Vulkan/etc., not through the allocation API itself? I don't think that's feasible. The vendors who write only a display device driver (display, not graphics) had no interest in implementing an EGL driver just to handle things like caps intersection. I doubt they'll even be able to write a Vulkan driver given the base requirements Vulkan puts in place for a device (graphics queues). At least EGL let you stub out everything.

I think there's too much focus on the simple client/server compositor graphics->display use case so far. That's certainly an interesting case to solve, but it's one of the simplest. We also have cases internally like an allocation server. From my understanding, gralloc operates in a manner similar to this internally, and I'm very interested in allowing gralloc to use this API as a backend of sorts. I doubt gralloc wants to grow a dependency on Vulkan/GL/whatever just so it can query surface capabilities.

I think a better way to solve your items (1) and (2) is to expose LUID/UUIDs for the devices of some sort. Each driver could decide how to generate them, but they could be enumerated from proc/sysfs with read-only permissions or similar by the allocator userspace drivers to avoid the need to open sensitive files in sandboxes. Generally, if a client didn't have access to a device, it probably shouldn't be trying to query capabilities from that device locally, so all it should need to do is know that some LUID/UUID from a capability it's receiving from an external source is NOT some device it has access to locally, rather than needing to know exactly which device that LUID/UUID refers to.

@jekstrand to be clear (and maybe I should spiff out USAGE.md a bit more) I don't think alloc_dev_t would be passed across the wire, but rather the serialized capabilities_set_t. Ie. alloc_dev_t is really a ptr w/ set of fxn ptrs to call into various "backend" implementations (ie. one provided by gl or vk driver, a generic one for v4l devices, a vendor one for whatever non-standard kernel uapi exists, etc).

I think I am on the same page in thinking of alloc_dev_t base struct as mostly a table of fxn ptrs, and the "core" maybe being some shims that just call fxn ptrs, plus some sort of loader dealing with the cases where there isn't a userspace API (ie. v4l).

(and, I'm just using the names liballoc/alloc_dev_t/etc as placeholder since we needed some names to use to talk about this.. if it gets to be more than just who can allocation/share what sorts of buffers in what formats, then maybe a name like "gbm2" makes more sense.. my personal hope is that for the gpu end of things, since gbm is already well supported by many drivers and by wayland compositors, mir, and other bare metal apps (and used by gbm-gralloc, etc) was that this new thing is more like a small extension or thing that sits on the side of gbm)

As far as explicitly opening devices, I think that is only needed in cases where the API used is a kernel API (kms, v4l, etc) and the client already has the device fd as their handle/context.. I expect for gl/egl, for example, we have some sort of extension that adds eglGetMeTheLibAllocDevice(dpy) which gives you back the alloc_dev_t.

I don't think a gl/vk driver would internally use "liballoc", since I don't think a gl/vk driver knows enough the use-case to know which devices are involved.

@cubanismo Maybe the LUID/UUID approach could work.. I'm not really sure. I think whatever figures out the capabilities_set_t likely needs to end up opening some kernel device file(s), and in a multi-process scenario, I don't think that is a good assumption that all processes can open all devices involved. For example, sandboxed flatpak wayland client, using privileged helper for hw video decode which eventually get passed to wayland compositor. I do definitely agree that vendors providing a kms-only display driver, or a v4l driver, aren't going to want to provide an EGL implementation for that, so I was kinda thinking that a "liballoc" backend is the thing they provide. (And in case of v4l and maybe kms we can probably provide some generic ones, although maybe vendor wants to override? not sure..)

At any rate, I'm not sure to which extent we are saying same or different things. So maybe writing up a few different flavor/proposals/alternatives for a USAGE.md type thing might make things more clear? I'm at least a fan of figuring out how we expect the API to be used as an early step.. if nothing else that gives us something to let potential users look at and comment on, but I does think it also makes it more clear whether we are saying the same or different things.

(and btw, maybe wiki pages would be more convenient for editing than a .md file in git tree.. not sure.. but one way or another putting together some example use-case/stories seems useful)

@robclark Yeah, LUID/UUID is just a strawman. I like it for another reason though, in that I want to correlate devices in liballoc (or whatever) with devices in other APIs somehow, and it would be pretty easy to add queries in those APIs that also return an LUID/UUID. Easier than adding an alloc_dev_t query, or an EGLDevice query in liballoc.

I have it on my long list of TODO's to hack up a bit of code for the allocator lib and a test client based on the header and your usage doc so far, if no one beats me to it. I find that's generally a good way to shake out basic design issues too.

@cubanismo btw, one question/thought.. are you anticipating that LUID/UUID is precise enough to identify, for example, exact model/revision of gpu, ie. to the point where the alloc_dev_t implementation would not have to open the device to know what it's capabilities were? (Ie. "this is an adreno 420 rev 2" vs "this is an adreno gpu"?) That might have been a misunderstanding on my part.. it would address some concerns about various different processes perhaps not being able to open each device. It might be a bit inconvenient for the v4l sort of use-case where vendor isn't otherwise providing a big userspace component, although perhaps less of an issue for cases where vendor is already providing a big userspace component (ie. like gl/vk driver)..

@robclark No, probably not generally. It would be nice if all drivers could export enough info in sysfs/procfs to trivially populate caps/constraints in userspace, but I doubt they all do. Ours certainly doesn't right now. In theory, we could have some giant lookup table based on the PCI device ID or something, but I'd hate to duplicate that knowledge in both kernel and userspace.

@cubanismo ok, I wasn't expecting that either, but wanted to check that I wasn't misunderstanding your idea. I guess we could invent a mechanism that devices could expose the necessary info through non-privileged sysfs or separate device files.. although adding kernel infrastructure probably isn't ideal for getting it to show up in android kernels or enterprise distro kernels any time in the near future. And I don't think you want to allow actual allocation through non-privileged interface so that wouldn't completely solve the "userspace has to do some IPC when there are containers" problem..

Woah... Lots of comments... I'll try to address stuff one-at-a-time.

@jekstrand You're saying you want caps/constraint querying done only through GL/Vulkan/etc., not through the allocation API itself? I don't think that's feasible. The vendors who write only a display device driver (display, not graphics) had no interest in implementing an EGL driver just to handle things like caps intersection. I doubt they'll even be able to write a Vulkan driver given the base requirements Vulkan puts in place for a device (graphics queues). At least EGL let you stub out everything.

Not "only". What I didn't say above (but intended to) was that, for things such as v4l or a KMS-only (no 3D) display, liballoc would be the userspace API. What I meant was that when we do have some other API such as Vulkan or GL, it would be nice if they didn't have to think about liballoc.

@jekstrand to be clear (and maybe I should spiff out USAGE.md a bit more) I don't think alloc_dev_t would be passed across the wire, but rather the serialized capabilities_set_t. Ie. alloc_dev_t is really a ptr w/ set of fxn ptrs to call into various "backend" implementations (ie. one provided by gl or vk driver, a generic one for v4l devices, a vendor one for whatever non-standard kernel uapi exists, etc).

I think we're on the same page there. It's just a pointer to an internal liballoc data structure that's mostly just a table of function pointers.

@cubanismo btw, one question/thought.. are you anticipating that LUID/UUID is precise enough to identify, for example, exact model/revision of gpu, ie. to the point where the alloc_dev_t implementation would not have to open the device to know what it's capabilities were?

We cannot assume that there is a direct mapping from physical device to capabilities. This is one of the reason's I don't want anyone to ever think they can serialize one. Capabilities are a function of the device, the usage, and the specific driver component using it. If your DDX doesn't want to support some compression format, it should be able to not advertise it. This is also why I'd like capabilities, as much as possible, to come from the GL/Vulkan driver rather than from liballoc directly. We may enable some compression format in GL before we enable it in Vulkan and so a UUID together with "sampled" isn't enough to figure out the capabilities.

As far as explicitly opening devices, I think that is only needed in cases where the API used is a kernel API (kms, v4l, etc) and the client already has the device fd as their handle/context.. I expect for gl/egl, for example, we have some sort of extension that adds eglGetMeTheLibAllocDevice(dpy) which gives you back the alloc_dev_t.

I think we're mostly on the same page here. My comment (1) above was mostly directed towards the ION case. If you have a GPU and display that have trouble getting along, one or both of them should know to tell liballoc to open ION and do the allocation that way. We shouldn't make the client have to think about "what if neither of my things can allocate".

Regarding the eglGetMeTheLibAllocDevice(dpy), I think it's ok to have things work by querying the GL/Vulkan driver for the alloc_dev_t as long as a GL and Vulkan driver on the same hardware are allowed to give back different alloc_dev_t instances with slightly different behavior. (And, please, let's not get EGL involved.) It's kind of awkward for an alloc_dev_t to not directly correspond to a hardware device, but it would be ok.