emersion/libliftoff

Benchmarks

emersion opened this issue · 8 comments

This would allow us to figure out whether an optimization is worthwhile and to prevent big performance regressions.

mupuf commented

Can you describe more what you have in mind here? Do you plan on counting how many atomic checks you do, or you would do full integration testing and checking how long it takes for certain scenarii?

Yeah, I was thinking about adding a test with a lot of layers and properties and make sure the time taken + number of atomic commits stays low. Full integration testing could be done manually by running the examples with a lot of layers, but would otherwise require something like VKMS.

Side note, on my machine an atomic check takes around 20µs. So the hard limit would be 800 atomic test-only commits. 400 would be more reasonable and leave room for the compositor to do other things.

Possible routes to improve performance in case buffer format/modifier isn't an issue:

  • Give up trying all combinations, stop backtracking when we managed to put a layer in a plane, just like Weston. This potentially gives a less optimal solution, but makes plane allocation less ioctl-intensive.
  • Do not try all combinations in one go. Split the work across multiple frames worth of time.
  • Somehow manage to check whether a buffer can be scanned out at all? Get the usage from GBM? We probably don't want to depend on GBM though.
mupuf commented

I think the best solution would be to give a list of constraints (time or ioctl counts) and let the compositor tell you what they want. This makes it a little harder for you as you need to cache your intermediate results, but that should not be too big in memory, right?

Somehow manage to check whether a buffer can be scanned out at all?

A linear search to check first on which planes a buffer can land might be a good idea in some cases... But I guess most of the problems will go away when we'll be able to query the formats supported by each plane.

I think the best solution would be to give a list of constraints (time or ioctl counts) and let the compositor tell you what they want. This makes it a little harder for you as you need to cache your intermediate results, but that should not be too big in memory, right?

Yeah, the issue is mostly to be able to suspend and resume the plane allocation algorithm at will.

But I guess most of the problems will go away when we'll be able to query the formats supported by each plane.

We can already do that, this information is in the IN_FORMATS property of planes. So we could pre-compute the set of planes a layer can end up on to avoid doing unnecessary ioctls.

I'm mostly worried about ARM GPUs where there are a shit ton of planes and all of them can do a lot of formats. For instance, here's the Raspberry Pi 3: https://drmdb.emersion.fr/devices/ab8561d9708e

One question is: what is the worst-case scenario for the plane allocation algorithm? If all layers are compatible with all planes, then we end up finding the best solution pretty early. The worst case makes the last step of each branch in the tree fail (if we have 3 planes, worst case is managing to put any layer into the first two but always fail to use the last one). This could happen in practice I guess. Is there a way to mitigate this case?

Time for a new brain dump.

Goal: minimize the number of atomic test-only commits. Optimize for the worst case scenario, ie. all layers are compatible with all planes (expect the last one, and layers don't intersect).

Number of atomic test-only commits with the current algorithm: let's say we have L=5 layers (l₁, l₂, l₃, l₄, l₅) and P=3 planes (p₁, p₂, p₃). The algorithm will explore the tree of all possible allocations (note, with a DFS):

At each node, the algorithm needs to try each remaining layer. The number of nodes is the number of atomic test-only commits. Assuming L > P, there are L . (L - 1) . … . (L - P + 1) = L!/(L - P)! nodes.

I first thought this wouldn't be that bad, but it turns out this is a lot of atomic commits. For just 5 planes and 10 layers, this is ~30K commits. if each commit takes 20µs, the whole plane allocation would take more than 1 second.

I've considered swapping planes and layers in the algorithm (at each step, take one layer, and figure out which plane is compatible). However this wouldn't cut it, because when L > P, only the first P layers would be processed. This also wouldn't reduce the complexity enough, because the graph would have P! nodes (with the 10 planes the RPi has, that means ~3M commits). Last, some drivers fail atomic commits unless the primary plane is enabled; enabling the primary plane first is difficult with this approach.

I've tried to figure out whether optimizing the current algorithm with dynamic programming would be possible. After all, the problem at hand looks a little bit like the knapsack problem: we have a bunch of layers (objects) with a priority based on the frequency of updates (value) and we want to pick as many as possible without exceeding the bandwidth limitations (weight).

We need a way to compute the solution for P planes from a solution with P-1 planes. However I haven't found a way to do this without saving a list of paths, effectively turning our DFS into a BFS without improving the overall complexity. In other words, I don't think we can de-duplicate operations we would do multiple times, so I don't think dynamic programming can help.

So I've been trying to figure out a way to try as many combinations as possible and give up after a deadline. For this to work, we'd need to order our tries: we want to put layers with a high priority before the others. If some layers are over a layer with a high priority, we need to try to put it into a plane first. I've already some ideas to do this, I just need to check it makes sense in as many situations as possible.

mupuf commented

Yeah, the additional constraints (priority order and the fixed Z-index of planes) will likely reduce massively the amount of possibilities.