Investigate more optimal way to implement CoreGraphics backend
Opened this issue · 16 comments
Apparently macOS (and iOS, #43) has a framework called IOSurface
for exchanging framebuffers and textures between processes, which sounds similar to the idea behind dmabufs on Linux. I think we should be use IOSurfaces
for a front and back buffer, and use IOSurfaceGetBaseAddress
to get a pointer to write into for no-copy presentation (#65)? Assuming it can work with the right pixel format.
Or are there issues with this, or a better way?
http://russbishop.net/cross-process-rendering describes how it is possible to create an IOSurface
with a size and format, access it from CPU, and set it as the contents
of a CALayer
.
- How would writing to the
IOSurface
from CPU perform? It would be good to have someone with a Mac that has a discrete GPU test this. - Synchronization: how do we make sure the surface is no longer in use by the display server when re-using it?
Or possibly we could just use CGImage
is is currently used, but with a CGDataProvider
that reads from memory we can mutate? Presumably CGImage
/CGDataProvider
assume the memory isn't mutated, but we could do that once the provider is released. But that isn't so simple without any guarantees about when it will be released, and since we probably can't block waiting for that either.
Edit: See https://developer.apple.com/documentation/coregraphics/cgdataproviderreleasedatacallback:
When Core Graphics no longer needs direct access to your provider data, your function is called. You may safely modify, move, or release your provider data at this time.
So comparing these:
IOSurface
- With unified memory, this should let us write directly into the memory the GPU, to be truly no-copy. With a discrete GPU, a DMA transfer is required to get it into GPU memory. With integrated graphics on Intel macs, I think memory wouldn't be "unified" and it may need to copy from the portion of the memory allocated to the CPU to the portion allocated to the GPU?
- Not sure how the synchronize, and make sure the
IOSurface
is no longer in use by the display server.
CGImage
with customCGDataProvider
- Saves the copy currently happening in the softbuffer backend, but CoreGraphics still needs to upload the data to GPU? (Into an
IoSurface
that is sent to the display server?)- Is there any possibility this upload could perform better than CPU access to the
IOSurface
? Presumably this is worse with unified memory, but maybe not otherwise?
- Is there any possibility this upload could perform better than CPU access to the
- Clear behavior with a release callback when it is no longer used by CoreGraphics
- Not sure when it will be released, and we likely need to be prepared to allocate more than 2 buffers, but the current implementation is already allocating a new one every present.
- Saves the copy currently happening in the softbuffer backend, but CoreGraphics still needs to upload the data to GPU? (Into an
For performance concerns, benchmarking is best. But we'd need a representative benchmark, an implementation of both, and multiple types of hardware.
Oh, I forgot about buffer stride.
Testing this (#95), it looks like we can't just set the stride to always match the width, so to use IOSurface
we'd need to provide a Buffer::stride
method. And users of the library would have to consider that.
This would probably also be needed for #42. Or if we wanted to use dmabufs instead of shm on wayland, etc.
- someone with a Mac that has a discrete GPU
Could be me, have a Mac right here with an AMD DGPU, as long as IOSurface exists on macOS 10.14.
https://developer.apple.com/documentation/iosurface says it was introduced in macOS 10.6 (sorry PowerMac G5 users), so that much shouldn't be an issue.
Great, I could proceed forward with:
- any test branch that (partially) implements this concept; I'd test correctness and performance with a profiler and see if I can make any further improvements
- pointers to reference implementations or other info on how this would go into
softbuffer
; I could attempt to implement this from scratch intosoftbuffer
and see how it goes - providing one of the Softbuffer members a remote desktop to my Mac (since I do not use it, and it's already set up with a working Rust + Xcode toolchain); I'd probably want to hop in a voice call and supervise / advise, so then it would be similar to pair programming I suppose
And as a bonus, implementing it all the way back on macOS 10.14 would ensure that softbuffer
still works back to at least that version. (No reason why it shouldn't, but it's a personal goal of mine to keep those old intels supported!)
I should be free to do any of those in around an hour :)
Reading up it looks like you're talking about having to expose a stride, let me introduce: imgref
! If softbuffer
needs a 0.4.0
for this, I'd be glad to participate in that API redesign since I've worked with these types of signatures somewhat extensively (grumble grumble looks at unreleased pixels
competitor). But anyway, take a look at my proposal above and see if anything looks reasonable to you. :)
#95 has an implementation using IOSurface. Which requires an API change to expose stride. And it updates the winit
and animation
examples to use this. #96 instead copies into an IOSurface on present (which requires no API changes). I did some performance testing of both on M1.
I wonder if there's a good way to automate benchmarking of softbuffer performance.
#95 has an implementation using IOSurface. Which requires an API change to expose stride. And it updates the
winit
andanimation
examples to use this. #96 instead copies into an IOSurface on present (which requires no API changes). I did some performance testing of both on M1.I wonder if there's a good way to automate benchmarking of softbuffer performance.
I'll check them out. I don't have an M1 to test with, but if you do, that should cover everything. My benchmark method typically tends to be instrumentation using Instant::now()
, it's not perfect but the margin of error is usually somewhere on the order of milliseconds and copies of large buffers are usually much more expensive than that so it should be good. (I'll figure it out when I have my paws on some local tests)
Once I have some thoughts I'll leave them on the relevant PR, or here if they affect both or are in general.
Alright, so based on my testing, for total render times:
copy-to-iosurface
spikes to 33ms for the first fullscreen frame, then 22ms for each subsequent framemaster
spikes up to 22ms for the first fullscreen frame, then 7ms for each subsequent frameiosurface-wip
spikes up to 16ms for the first fullscreen frame, then 16ms for each subsequent frame
I think the 16ms might be a fluke here, it makes you think it might be vsync but it's consistently lower than 16ms for small windows and consistently higher than 16ms for larger-than-screen windows. In fullscreen, it doesn't seem to ever take longer than 18ms or so, but this is still beat by master
's 7ms.
Also, copy-to-iosurface
is clearly worthless and should be scrapped, as benchmarks prove that more copies won't help anything. /hj
Here are some more detailed breakdowns per-branch:
-
master
:buffer: 1600x1200 resize: 0us fill: 6028us present: 42us buffer: 1600x1200 resize: 0us fill: 3973us present: 19783us buffer: 2880x1800 resize: 0us fill: 16733us present: 25us buffer: 1600x1200 resize: 0us fill: 3951us present: 20us buffer: 2880x1800 resize: 0us fill: 10252us present: 20us buffer: 1600x1200 resize: 0us fill: 4080us present: 15us
-
copy-to-iosurface
:buffer: 1600x1200 resize: 4877us fill: 3542us present: 4655us buffer: 1600x1200 resize: 0us fill: 3450us present: 1811us buffer: 2880x1800 resize: 6757us fill: 13651us present: 12984us buffer: 1600x1200 resize: 108us fill: 3606us present: 4900us buffer: 2880x1800 resize: 879us fill: 8938us present: 12655us buffer: 1600x1200 resize: 114us fill: 3368us present: 5558us
-
iosurface-wip
:buffer: 1600x1200 resize: 0us fill: 7580us present: 51us buffer: 1600x1200 resize: 0us fill: 6901us present: 736us buffer: 2880x1800 resize: 0us fill: 17087us present: 23us buffer: 1600x1200 resize: 0us fill: 6369us present: 25us buffer: 2880x1800 resize: 0us fill: 16601us present: 27us buffer: 1600x1200 resize: 0us fill: 6400us present: 24us
Now Wait Just A Minute, there's something fishy here.
Let's see:
-
copy-to-iosurface
, of course, always takes an ungodly amount of time to present, because of course it does. However, the resize and fill times are basically identical tomaster
(makes sense, since they use the same style of managed buffer).copy-to-iosurface
is a strict downgrade. -
iosurface-wip
has the lowest maximum present time of all of them, just 736μs compared tomaster
's occasional 19783μs (woah) andcopy-to-iosurface
's 12984μs. However, it has the highest fill time - it somehow takes longer to write into the buffer in the first place.
This makes me wonder if IOSurface is somehow magical! The memory backing it seems to somehow be more expensive than normal memory, perhaps it's some sort of MMIO or something. Anyway, this prompted me to do some more testing. My method of filling buffers quickly is to use rayon
to fill it using multiple threads, so let's try that:
-
master
:buffer: 1600x1200 resize: 0us buffer_mut: 8us fill: 4699us present: 45us total: 4707us buffer: 1600x1200 resize: 0us buffer_mut: 984us fill: 1494us present: 19263us total: 2479us buffer: 2880x1800 resize: 0us buffer_mut: 2us fill: 8364us present: 22us total: 8367us buffer: 1600x1200 resize: 0us buffer_mut: 767us fill: 1201us present: 24us total: 1968us buffer: 2880x1800 resize: 0us buffer_mut: 2064us fill: 2797us present: 23us total: 4862us buffer: 1600x1200 resize: 0us buffer_mut: 800us fill: 1307us present: 14us total: 2107us
-
copy-to-iosurface
:buffer: 1600x1200 resize: 4791us buffer_mut: 0us fill: 2454us present: 5828us total: 7246us buffer: 1600x1200 resize: 0us buffer_mut: 0us fill: 1773us present: 1709us total: 1773us buffer: 2880x1800 resize: 6402us buffer_mut: 0us fill: 4993us present: 12193us total: 11396us buffer: 1600x1200 resize: 86us buffer_mut: 0us fill: 1435us present: 4449us total: 1522us buffer: 2880x1800 resize: 879us buffer_mut: 0us fill: 2782us present: 13297us total: 3662us buffer: 1600x1200 resize: 83us buffer_mut: 0us fill: 2339us present: 4824us total: 2423us
-
iosurface-wip
:buffer: 1600x1200 resize: 0us buffer_mut: 682us fill: 4535us present: 67us total: 5218us buffer: 1600x1200 resize: 0us buffer_mut: 63us fill: 4093us present: 454us total: 4157us buffer: 2880x1800 resize: 0us buffer_mut: 133us fill: 9323us present: 47us total: 9456us buffer: 1600x1200 resize: 0us buffer_mut: 117us fill: 3933us present: 22us total: 4050us buffer: 2880x1800 resize: 0us buffer_mut: 162us fill: 8543us present: 35us total: 8706us buffer: 1600x1200 resize: 0us buffer_mut: 87us fill: 3817us present: 37us total: 3905us
Much better?
As far as I can tell, iosurface-wip
is the way to go, because it's a lot more consistent than master
even if it's slightly slower to write. Meanwhile copy-to-iosurface
... yeah. Throw it in the bin, lol
ideally these are also tested on apple silicon, to see how it behaves with the unified gpu memory
ideally these are also tested on apple silicon, to see how it behaves with the unified gpu memory
Of course, I was assuming that @ids1024 (or someone else) would get back to me with comparisons on ASi to see if iosurface-wip really is the best choice for both, but it seems like that hasn't happened yet.
what's the easiest way to repro your test?
what's the easiest way to repro your test?
instrument the code with some Instant::now()
s, then eprintln!("took {}us", (b - a).as_micros());
at the end of the frame. I don't have an exact diff
On the master
branch, the winit
example consumer very large amount of memory when continuously resized. This issue seems to be fixed on the iosurface-wip
branch. I tested this on a M1 mac.