QoL improvement for displaying GPU profiling information

Question

QoL improvement for displaying GPU profiling information

DBouma opened this issue 3 years ago · 5 comments

The problem

Currently if you want to display GPU timestamps in puffin, you are forced to open a stream and manually convert all your measured timings from GPU timeline to the associated CPU timeline. Since GPU work is always executed an arbitrary time after a submit call, we would either need to:
A: Wait on a fence before proceeding to the next frame to associate GPU timings with the current measured frame
B: Read back n frames back and display that into the current puffin frame
C: Modifying a previously written frame with the measured GPU timings

In case A, puffin displays the total CPU frame plus the time elapsed from cpu work submission to gpu work execution, along with the total GPU frame. While technically correct, this results in a total frame time that is significantly higher compared to the actual frame time of an app, which is not super useful for profiling a total frametime since work overlaps in reality. This is also not an option in general as this forces single buffering.

In case B, puffin displays the measured frame n on the timeline before puffin's measured cpu frame times. This is also technically correct but again suffers from the same issue that case A has (total frame time not being super useful). However this allows for double/triple buffering without too much trouble.

Case C is not something that we are using puffin for, since we use it for realtime visualization of CPU and GPU workloads. This could however be worked around with by allowing a frame visualization delay, but that would be a different Issue alltogether.

Suggestion

I believe puffin would greatly benefit from the ability to separate timelines. Currently it's separated per thread but still shares the same timeline, thus sharing the total execution time. But for GPU timings it makes more sense to be able to define a separate timeline that has it's own "frame time", while potentially still being able to associate CPU and GPU work.

Alternatives

We currently have a little workaround locally that shifts the measured GPU timings to the start of the current puffin frame, essentially working around the total frametime if needed. However this workaround is not ideal and does not give the correct context to where the GPU workload was fired off.

Answer 1 · 2022-01-25T09:15:23.000Z

I think this would be a great improvement! Possibly behind a feature toggle? If your gonna give this a shot and run into problems feel free to reach out and we can look at solutions together.

Answer 2 · 2022-01-25T09:47:25.000Z

One solution we discussed was to display a total duration "per thread" stream, ie. right after puffin::ThreadInfo::name. This will still poison the total frame times used in the "recent" and "slowest" frame view though, which appear to be much more complicated to untangle.

For recombining a CPU recording with the GPU timestamps we envision some small refactors to be able to take out the GlobalReporter (the FrameData with perhaps multiple streams it recorded) at the end of a frame and substitute it with a new/empty one. The stream will be held on to until the GPU frame is complete, at which point everything is translated and sent to a sink at once.

Finally, for mapping the GPU timeline to the CPU timeline we currently perform some improper tricks such as calling puffin::now_ns() at around the same time we "calibrate" the clock through the respective graphics APIs (not that it matters when GPU and CPU timelines are shifted to overlap). A better solution is to be able to map the CLOCK_MONOTONIC or QueryPerformanceCounter result from those graphics APIs to puffin::now_ns() directly, but even if we get access to puffins static value for std::time::UNIX_EPOCH.elapsed() there's no way to take the other raw value out of Instant::now().
Fortunately though it seems we can easily replace ThreadProfiler::now_ns with a custom function that returns CLOCK_MONOTONIC/QueryPerformanceCounter directly 😄

Answer 3 · 2022-01-25T14:13:46.000Z

We currently do the same trick with puffin::now_ns(). Would definitely be nice to replace that with something proper. A function that returns the clock seems like an easy solution.

I'm not too familiar with the code base yet but what your suggesting makes sense to me. @emilk Do you have any opinions around this?

Answer 4 · 2022-01-25T14:22:16.000Z

I'd implement the syscalls that feed Instant by hand (as mentioned above the underlying raw value is not available) in now_ns(), so that we have direct access to the raw value that matches what the graphics APIs provide.

Next we provide the UNIX start timestamp with the raw clock timestamp publicly to the user, they can then do the calculation and translate a GPU time with clock calibration to CLOCK_MONOTONIC/QueryPerformanceCounter from the graphics API first, then to puffin by adding the UNIX start time.

I can PR this if everyone agrees on this solution :)

Answer 5 · 2022-01-26T09:13:57.000Z

I believe puffin would greatly benefit from the ability to separate timelines. Currently it's separated per thread but still shares the same timeline, thus sharing the total execution time. But for GPU timings it makes more sense to be able to define a separate timeline that has it's own "frame time", while potentially still being able to associate CPU and GPU work.

If you want separate time-lines, then it sounds like you should just report CPU times and GPU times to two different places! Use puffin::GlobalReporter for the CPU time (for convenience with the macros), and then report GPU times into its own puffin::FrameView. When visualizing, use two separate puffin_egui::ProfilerUi 🤷‍♂️