Possible high cpu usage drawing lines ?

Question

Possible high cpu usage drawing lines ?

Closed this issue 5 years ago · 13 comments

Trying a test I have previously done with the nanovg library I found that the cpu usage of fastuidraw is about 7 times higher in this particular test. Please see the attached screenshots. Both tests were compiled with gcc 7.4.0 in release mode (-O3) and I am using fastuidraw release libraries.

The fastuidraw test code is here.
The nanovg test code is here.

It was curious that the fastuidraw cpu load started low but after a few seconds it stabilized in about 29%. Both tests were executed with no major applications running in the machine (only system services).

Any ideas on how to modify the test to lower fastuidraw cpu load ?

Answer 1 · 2019-06-18T11:49:31.000Z

The issue is that the test application is recreating the path -every- frame. This is a VERY heavy operation. To get optimal performance with FastUIDraw, create the fastuidraw::Path -once- and reuse it across many frames. Alternatively, one can create ones own fastuidraw::PainterAttributeWriter and use the methods of fastuidraw::StokedPointPacking to pack attribute data without creating a path every frame.

A Path internally is a HEAVY object which internally does oodles of dynamic memory allocation up-front. Chances are the high CPU usage is from that (I suspect that you'd get a performance boost preloading Google's TCmalloc library to make memory allocation/freeing faster).

Answer 2 · 2019-06-18T15:12:43.000Z

One last bit of advice: make just two paths. The first path, V, a single vertical line from (0, 0) to (0, WindowHeight) and a second path H, a single horizontal line from (0, 0) to (WindowWidth, 0). Then to draw the grid as in the example do:

float width = 1.0;
brush.color(0, 0, 0, 1);
painter->save();
for (float y = 0; y < win_dims.y(); y += 60.0)  {
    painter->translate({0.0f, 60.0f});
    painter->stroke_path(brush, fd::PainterStrokeParams().width(width), H);
    width += 1.0f;
}

painter->restore();

Admittedly the above is not the same exact horizontal lines, but the graphic output is the same.

Answer 3 · 2019-06-18T17:19:39.000Z

But the paths would need to be recreated when the window is resized right ?
Also the lines width/height are not constants because as I want the lines to have
the same length and thicker lines are longer than thin lines using the same distance
between end points, I need to adjust the start and end points each time.
So I would need also to scale the path in some way, right ?

Answer 4 · 2019-06-18T19:03:15.000Z

To make the lengths the same but stroking a different width, set the cap style of the passed StrokingStyle to flat_caps instead of the default square_caps.

Yes the path will need to be cleared and made again on window resize. Resizing a window is already very memory intensive to not only create the new PainterSurface, but the GL implementation also needs to allocate a new backing buffer for the window. That and window resize is a rare event compared to the number of frames an application draws (usually).

Answer 5 · 2019-06-18T21:13:35.000Z

The new test is here. Again, the cpu load starts below 10% and after approximately 20 seconds goes to ~24%.

Answer 6 · 2019-06-19T05:30:11.000Z

The code calling FastUIDraw looks good to me; the high cpu usage seems is insane; just wondering, are the framerates of the NanoVG and FastUIDraw same?

At this point, I would use perf to see what is using all those cpu cycles if they have same framerates.

Answer 7 · 2019-06-19T07:02:11.000Z

After somework, I got the demos to run locally on may ancient 10+ year old laptop (one should include when using std::runtime_error(), and I needed to tweak the cmake files to link correctly against FastUIDraw). I also see a much higher CPU usage from fastuidraw than nanovg. As to why: I do not know. This laptop does not have perf working, so I cannot pinpoint the issue, but I suspect that there is something awkward going on for how FastUIDraw is streaming the vertex/attribute data to the driver.

My only guess at this point is that buried in src/fastuidraw/internal/private/gl_backend/painter_vao_pool.cpp is that FastUIDraw creates a new VAO whenever one is requested instead of reusing them. In ideal circumstances, FastUIDraw would be written to reuse them, but it can't because of the wonkiness of OpenGL. Specifically, a VAO can only be used in the GL context that was used to create it. No part of FastUIDraw requires that the same GL context is used all the time, indeed one proof-of-concept I did had multiple GL contexts active at different times and I changed from pooling the VAO's to recreating them. But this is just a hunch... and I am not completely convinced that this is the real cause.

Again use perf to pinpoint what function/functions is using all that CPU time. Also run the tests with glfxSwapInterval(0) to see their raw performance (the nanovg runs faster because its shader is SOO simple).

Answer 8 · 2019-06-19T07:47:28.000Z

Update:

Changing FastUIDraw to reuse VAO's made no difference
Got perf top to work. It reports that the CPU-usage is coming from the kernel functions, the biggest one being clear_page().

I strongly suspect that is coming from the NVIDIA driver. I suggest running the tests on Intel GPU's. I strongly suspect that the CPU utilization will go down ALOT.

As to why, chances are that how FastUIDraw streams attribute, index and generic data via GL buffer objects is making the NVIDIA driver upset. How to fix, I am not so sure since as far as I can tell, FastUIDraw is giving all the hints it can to GL:

Create buffers with GL_STREAM_DRAW (I also experimented with GL_DYNAMIC_DRAW which might be more correct)
map buffers with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_FLUSH_EXPLICIT_BIT
unmap buffers with glFlushMappedBufferRange and glUnmapBuffer

At any rate, try the tests on an Intel GPU and see if the CPU utilization goes down.

Answer 9 · 2019-06-19T10:51:24.000Z

Did some hacking today; by doing buffer object orphaning (i.e. call glBufferData to set BO data) instead of mapping massively reduced the CPU load on my ancient laptop with an (ancient) NVIDIA GPU.

Sighs. Strictly speaking this huge cpu utilization on the NVIDIA driver is, in my opinion, a driver bug since it is supposed to be better to map a GL buffer object with invalidating it and such instead of glBuffferData(). I know on Mesa with Intel GPU, doing mapping is better. I'll post the branch with this NVIDIA fix in a separate repo.

Answer 10 · 2019-06-19T11:08:11.000Z

The branch at https://github.com/krogueintel/fastuidraw/tree/no-gl-buffer-mapping (i.e. branch no-gl-buffer-mapping from repo at https://github.com/krogueintel/fastuidraw/) contains the rework of using orphaning buffer objects instead of mapping buffer objects. Try this branch out, but be warned it is the result of a few hours of hacking and I do not trust it fully.

Also, it is essentially a workaround for a bad semi-buggy behavior in the NVIDIA driver.

Answer 11 · 2019-06-19T16:31:46.000Z

I was not able to run the test in a laptop with a dual Intel/Nvidia controller with the Intel controller selected. The fastuidraw test didn't recognize a recent version of OpenGL (see the initial log lines in the screenshots). But another application (our Go game engine) recognizes OpenGL 4.5 and run OK (but with low frame rate).

Answer 12 · 2019-06-19T18:06:34.000Z

The issue of a laptop with both Intel and NVIDIA GPU is a separate issue, please file a new issue for it. With that in mind, that your lines2 test using GLFW did not get a GL context is likely an issue of how the GL context was created and NOT a FastUIDraw issue but a GLFW issue (or using GLFW correctly). However, I see the issue easily. On Mesa with an Intel GPU, unless one requests a CORE context, the GL version maxes out at 3.0. To get a higher GL version, you need to ask for a CORE profile. To do that with GLFW, the function glfwWindowHint() together with (GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE), (GLFW_CONTEXT_VERSION_MAJOR, 3) and (GLFW_CONTEXT_VERSION_MINOR, 3) will give you what is necessary. You can also look to the FastUIDraw example code: https://intel.github.io/fastuidraw/docs/html/d4/da1/ex_framework.html and pay attention to how it creates a GL context. One of the most irritating issues out there is that Mesa's GL context creation return results are very different from all other GL implementation context creation results. Specifically:

Mesa typically only return a 3.0 GL context if a context is create the old fashioned way or if a compatibility profile is requested. All other GL implementations for Linux (and MS-Windows) give the highest GL version possible with a compatibility profile.
Mesa however will give the highest GL version possible if a core profile (with version atleast 3.2) is requested. In contrast, most other GL implementations (Linux and MS-Windows) give just the version requested but with oodles of extensions.

Returning back the issue of high cpu usage, did you try the fix at the branch https://github.com/krogueintel/fastuidraw/tree/no-gl-buffer-mapping on the device you were using previously? Did it also lower the CPU usage for you. If it did, I can look into merging the code into the repo's master branch and to select the orphaning mode for NVIDIA GPU's.

Answer 13 · 2019-06-19T21:22:33.000Z

The test was changed to request OpenGL 3.3 core profile.
Testing the new branch with the NVIDIA desktop had the result: cpu load is stable at ~7%.
Testing the current master with the laptop/Intel GPU had the result: cpu load is stable at ~5%.

So it seemed to be an NVIDIA issue and it was solved by the new branch.