matplotlib/mplcairo

Optimal usage for better rendering speed for multiple (video) frames based on mplcairo?

s-m-e opened this issue · 21 comments

s-m-e commented

I am using matplotlib and mplcairo.base for rendering videos. I am using pillow for some post-processing of the individual video frames, so I am (a) triggering figure.canvas.draw() and (b) I am converting the figure to a pillow image by invoking PIL.Image.fromarray(figure.canvas.renderer.buffer_rgba()). I am intentionally creating a new figure per video frame (though this could be simplified in one way or the other). For a complete self-contained example, see below.

On a single core of a somewhat dated 3rd-gen i7, I am getting about 10 frames per second based on this approach. I was wondering whether there is any potential for further optimizations (with respect to the usage of matplotlib and mplcairo.base) that I may have overlooked.

import mplcairo.base # import before matplotlib (?)
import matplotlib
matplotlib.use("module://mplcairo.base", force = True) # use mplcairo.base as non-GUI backend
import matplotlib.pyplot as plt # import pyplot last

from PIL.Image import fromarray

from tqdm import tqdm

FRAMES = 120
DPI = 100
x = list(range(FRAMES))
y = [item ** 2 for item in x]

stream = open('/dev/null', 'wb') # target for images ... something like ffmpeg

for frame_number in tqdm(range(FRAMES)):

    fig = plt.figure(figsize = (1920 / DPI, 1080 / DPI), dpi = DPI)

    ax = fig.subplots()
    ax.plot(x[frame_number], y[frame_number]) # simple test plot

    fig.canvas.draw() # draw figure (appears to be required?)
    image = fromarray(fig.canvas.renderer.buffer_rgba()) # convert image to PIL object

    plt.close(fig) # destroy figure

    # some post-proc on "image" ...

    image.save(stream, 'bmp') # send image to stream
    stream.flush() # flush the stream's buffer
    image.close() # destroy image

stream.close()
  • You are creating a new figure and axes at every iteration of the loop. This is quite slow; you should instead reuse the same axes object (even better, reuse the same Line2D object and just update its data). Not triggering autoscaling at every iteration would also help.
  • On mplcairo's side, the slowest part is (by far!) the conversion from cairo's internal format (premultiplied ARGB32) to the format that pillow wants (non-premultiplied RGBA8888) (see e.g. https://en.wikipedia.org/wiki/RGBA_color_model#Representation). You can access the raw renderer buffer with canvas.renderer._get_buffer() (there's supposed to be a public API for that via mplcairo.get_raw_buffer(canvas), but that was broken until I read this (so thanks for the report :)), and I just pushed a fix), likely you can then stream the raw bytes to ffmpeg together with something like -pix_fmt bgra although I don't know exactly what ffmpeg does wrt. alpha. I guess more generally better documenting how to achieve this last part (and its interplay with matplotlib.animation) would be nice.

From a quick profiling I guess you can probably gain a factor of 2 with these changes? (Consider using something like https://github.com/benfred/py-spy for this kind of investigations.)

s-m-e commented

You are creating a new figure and axes at every iteration of the loop [...]

I know that this is less than ideal, but that's another playing field. I am trying to develop a library where the management of figures is left to the user (see here for an example). The user may or may not do what you suggest. I am "simply" trying to "extract" an image from the figure within my library.

[...] conversion from cairo's internal format (premultiplied ARGB32) to the format that pillow wants (non-premultiplied RGBA8888) [...]

Thanks a lot, this is very helpful. The calls to buffer_rgba account for about a third of the runtime of my example. I am actually relying on pillow for post-processing and compositing, so I guess I'd have to optimize this very part: The hand-off from mplcairo to pillow (?). From the top of your head: Do you see any (theoretical) room for improvement for this particular task within your code?

[...] but that was broken until I read this (so thanks for the report :))

You're welcome. I spent a few hours trying to get access to the raw buffer and thought it was me ... until I gave up ;)

Do you see any (theoretical) room for improvement for this particular task within your code?

That's some very boring numpy (see the various cairo_to_foo functions in _util.py). You could try rewriting it in C(++) as there's usually some non-negligible overhead from numpy, though I have no idea of how much. You could also try to PR support for ARGB32 into Pillow :)

I moved the converters to C++. Indeed they are much faster now; please give them a try.
I'll keep this open re: documentation of high-performance (conversionless) interaction with matplotlib.animation, though.

s-m-e commented

Wow, thanks a lot. It's the stuff in master, commit c3c1732, right?

Btw: I did not have time to poke around this library in depth (yet), but I noticed that you are using pycairo (instead of plenty of other Python bindings for cairo). Is there some place where you directly/publicly expose the pycairo context object and/or the respective image surface object?

Just for comparison, I'd like to test and benchmark this: I believe pillow can handle the conversion itself, if I understand the docs correctly. I just do not know about the speed, but pillow can handle mode = "RGBa" (pre-multiplied alpha) for importing images at least, which (I guess) results in pillow converting the image internally. I tested it with the frombuffer method on pycairo image surface objects and it appeared to work.

s-m-e commented

This shaves a lot of runtime off of my example. I turned the turbo of my CPU off for somewhat cleaner results ...

  • mplcairo 3.3 (official manylinux): 7.75 fps
  • mplcairo 3.3 (built locally, openSUSE Leap 15.1 - just for reference): 7.75 fps
  • mplcairo 0.3.post42+gc3c1732: 12.05 fps

That's a 1.5x improvement. (The version number of the git-based build is a bit odd.)

For comparison, I left the conversion to pillow (and validated the results against your implementation):

  • pillow variation 1 (split): 10.79 fps
    image = frombuffer(mode = "RGBa", size = (1920, 1080), data = fig.canvas.renderer._get_buffer())
    b, g, r, a = image.split()
    image = merge('RGBa', (r, g, b, a)).convert("RGBA")
  • pillow variation 2 (numpy slicing): 8.57 fps
    image = fig.canvas.renderer._get_buffer()
    image[..., :3] = image[..., 2::-1]
    image = frombuffer(mode = "RGBa", size = (1920, 1080), data = image).convert("RGBA")

pillow on its own is not too bad (though still a little slower than your implementation), but as soon as numpy gets "too involved", speed drops ...

Thanks a lot!


Btw, gcc (7.5.0) throws a few warnings when compiling your latest changes.

In file included from src/_unity_build.cpp:2:0:
src/_mplcairo.cpp: In member function ‘pybind11::array_t<unsigned char> mplcairo::Region::get_st_rgba8888_array()’:
src/_mplcairo.cpp:116:37: warning: unused variable ‘x0’ [-Wunused-variable]
   auto const& [x0, y0, width, height] = bbox;
                                     ^
src/_mplcairo.cpp:116:37: warning: unused variable ‘y0’ [-Wunused-variable]
src/_mplcairo.cpp: In member function ‘pybind11::bytes mplcairo::Region::get_st_argb32_bytes()’:
src/_mplcairo.cpp:138:37: warning: unused variable ‘x0’ [-Wunused-variable]
   auto const& [x0, y0, width, height] = bbox;
                                     ^
src/_mplcairo.cpp:138:37: warning: unused variable ‘y0’ [-Wunused-variable]
src/_mplcairo.cpp: In destructor ‘mplcairo::GraphicsContextRenderer::~GraphicsContextRenderer()’:
src/_mplcairo.cpp:294:36: warning: unused variable ‘pathspec’ [-Wunused-variable]
     for (auto& [pathspec, font_face]: detail::FONT_CACHE) {

Wow, thanks a lot. It's the stuff in master, commit c3c1732, right?

Yes.

Btw: I did not have time to poke around this library in depth (yet), but I noticed that you are using pycairo (instead of plenty of other Python bindings for cairo).

Only on Unix. This is mostly a way to make sure that libcairo.so is present on the system and can be loaded (I basically delegate that task to pycairo, and then reuse the library handle from pycairo). I could have used cairocffi instead, but pycairo additionally lets me support gtk_native rendering (see from_pycairo_ctx), which necessarily relies on pycairo, so that was the tiebreaker.

Is there some place where you directly/publicly expose the pycairo context object and/or the respective image surface object?

No, but that should again be easily implementable. Do you want to give it a try? :p Although if all you care about is benchmarking, see the answer below.

Just for comparison, I'd like to test and benchmark this: I believe pillow can handle the conversion itself, if I understand the docs correctly. I just do not know about the speed, but pillow can handle mode = "RGBa" (pre-multiplied alpha) for importing images at least, which (I guess) results in pillow converting the image internally. I tested it with the frombuffer method on pycairo image surface objects and it appeared to work.

I don't think you need to get a pycairo image surface for that; get_raw_buffer should really have minimal overhead as there's no copy involved whatsoever (I guess I could also return a memoryview instead of a numpy array for even less overhead, but hopefully that doesn't matter too much?...).

This shaves a lot of runtime off of my example.

I would still strongly suggest moving figure/axes instantiation out of the loop and time only the call to fig.canvas.draw(); even though I realize that you may not be able to control that, the only part I can help with in mplcairo is that one call. If you want to speed up anything else, you should report on the Matplotlib bug tracker.

Btw, gcc (7.5.0) throws a few warnings when compiling your latest changes.

Looks like these were fixed somewhere between gcc 7.5 and 10.2 but sure, I can improve that.

s-m-e commented

No, but that should again be easily implementable. Do you want to give it a try? :p Although if all you care about is benchmarking, see the answer below.

Benchmarking is just part of my problems, but this would probably deserve another issue. I was simply wondering whether I could allow a user to directly interact with a pycairo object as my library is already built around pycairo in other places. So if I find some time, sure, I'll give it a try :)

I would still strongly suggest moving figure/axes instantiation out of the loop and time only the call to fig.canvas.draw(); even though I realize that you may not be able to control that, the only part I can help with in mplcairo is that one call. If you want to speed up anything else, you should report on the Matplotlib bug tracker.

Thanks for the suggestion and all the help. Well, re-using a figure makes a lot of sense if you know what you are doing. From practical experience, a lot of "casual" matplotlib user really do not, I am afraid. I am essentially trying to build a wrapper allowing to render video frames in parallel and out of order, which makes the entire re-use-bit a little tricky for an average end-user. Therefore, from a design-perspective, my best bad idea is to generate a new figure per video frame by default - unless a user explicitly disagrees with this approach. In comparison, matplotlib.animation follows fundamentally different design paradigms (targeting somewhat different use-cases). I am trying to build a more generic system in which I can combine / composite images from multiple plot systems, which is a task that I am coming across rather often. I have been hacking stuff like this around matplotlib.animation, but this usually ends up being rather annoying, inflexible and unmaintainable. Anyway, I am very open to suggestions in terms of providing a robust and more sensible default approach to potential users.

get_raw_buffer should really have minimal overhead as there's no copy involved whatsoever (I guess I could also return a memoryview instead of a numpy array for even less overhead, but hopefully that doesn't matter too much?...).

I'll play a bit more and let you know what I find.

Looks like these were fixed somewhere between gcc 7.5 and 10.2 but sure, I can improve that.

No need to fix it from my perspective. I posted it just in case you missed something. SUSE Leap runs on a rather dated toolchain, I know.

re: figure.canvas.draw(): Certainly I understand that many users will just be recreating figures all the time. I think better docs can help there, but really, my only point was that you asked a question about performance, and if you want to benchmark mplcairo's performance then you should use fig.canvas.draw() because stuff outside of it is, well, outside of mplcairo's control.

re: get_raw_buffer performance: you could try changing _get_buffer and the functions down the chain to return py::memoryview, which should mostly be a matter of calling py::memoryview::from_buffer (https://pybind11.readthedocs.io/en/stable/advanced/pycpp/numpy.html#memory-view) at the lowest level. Do let me know if you see any speedup there.

Looks like

diff --git i/src/_mplcairo.cpp w/src/_mplcairo.cpp
index 40e68d4..287f9c9 100644
--- i/src/_mplcairo.cpp
+++ w/src/_mplcairo.cpp
@@ -564,6 +564,18 @@ void GraphicsContextRenderer::_show_page()
   cairo_show_page(cr_);
 }
 
+py::object GraphicsContextRenderer::_get_context()
+{
+#ifndef _WIN32
+  cairo_reference(cr_);
+  return py::reinterpret_steal<py::object>(
+    PycairoContext_FromContext(cr_, &PycairoContext_Type, nullptr));
+#else
+  throw std::runtime_error{"_get_context is not available on Windows"};
+#endif
+}
+
 py::array GraphicsContextRenderer::_get_buffer()
 {
   return image_surface_to_buffer(cairo_get_target(cr_));
@@ -2113,6 +2125,7 @@ Only intended for debugging purposes.
     .def("_set_metadata", &GraphicsContextRenderer::_set_metadata)
     .def("_set_size", &GraphicsContextRenderer::_set_size)
     .def("_show_page", &GraphicsContextRenderer::_show_page)
+    .def("_get_context", &GraphicsContextRenderer::_get_context)
     .def("_get_buffer", &GraphicsContextRenderer::_get_buffer)
     .def("_finish", &GraphicsContextRenderer::_finish)
 
diff --git i/src/_mplcairo.h w/src/_mplcairo.h
index 15756f0..9526acf 100644
--- i/src/_mplcairo.h
+++ w/src/_mplcairo.h
@@ -68,6 +68,7 @@ class GraphicsContextRenderer {
   void _set_metadata(std::optional<py::dict> metadata);
   void _set_size(double width, double height, double dpi);
   void _show_page();
+  py::object _get_context();
   py::array _get_buffer();
   void _finish();

is enough for you to get the context via renderer._get_context() (and then the surface via get_target()), although it's quite likely I may have gotten some refcounting wrong somewhere.

I pushed mplcairo.get_context(canvas) to master.

Actually, now that I have exposed get_context and that conversion to RGBA8888 has been greatly sped up, I doubt that switching between sending ARGB and BGRA to ffmpeg really matters wrt. speed, especially considering all other slow downs which are out of my control (inter-process communication and ffmpeg's own encoding), so I'll close this for now. Thanks for the report, and feel free to request a reopen if you can point out to other things that can be improved on my side.

s-m-e commented

Thanks for all the help. I've tried to document the usage of matplotlib in the context of my library. I'd very much appreciate a brief review of those words :)

Just for my understanding: When do you plan on releasing a new version of this package including the above mentioned changes? I'd add some info to my documentation saying that things "become faster" with mplcairo >= x.y.z.

Regarding a new release of mplcairo: looks like the Windows CI build just broke again, and I cannot repro the failure locally (nor can I fix the CI by reverting to a previously successful commit) :/ So I'll have to investigate that first and can't make any guarantees re: time.

Regarding your docs (which are super complete, that's an impressive tool!): clear() + plot() is not the best way to go; rather you should create a Line2D object (e.g. plotting empty data self._line, = self._ax.plot([], []) in the beginning, and then only update the data in the line (self._line.set_data(...)). Note that this will not re-autoscale (unless you call relim() and autoscale_view() after setting the data), but skipping that step (as well as the new Line2D instantiation) should help with performance.

s-m-e commented

Release: Thanks for the effort, noted.

Docs: Also thanks. I agree that this example is less than ideal. I was looking for some middle-ground for a relatively easy "getting started" example as in that it is "faster" than generating a new figure every time (which it is). I know there is tons of documentation on how to do what you describe, i.e. swap data in lines etc. I'd rather link on a good explanation than add a ton of "unrelated" content into my docs. I can add you suggestion to my example - but could you recommend a good introduction into this topic that I could link to underneath?

I can't think of one right now. I guess you could always open an issue on the Matplotlib tracker requesting this information to be collected somewhere, I think that would be a useful addition, as I certainly agree updating data on preexisting artists is an advanced concept (it's basically the next step after the OO approach, which is already (sadly) not so well-known).
Also, I just realized that along an orthogonal axis :-), another thing that can speed up Matplotlib is the tricks described in https://matplotlib.org/tutorials/introductory/usage.html?highlight=simplify_threshold#performance.

I spent a bit of time looking at the Windows failure but I still can't even repro it. In particular, it does not appear to be caused by the well-known numpy1.19.4+Windows bug, as reverting to numpy 1.19.3 does not fix it.
Having some way to dump an actual backtrace on Windows fatal exceptions (similar to

void install_abrt_handler()
but for Windows) would likely be very heplful...

s-m-e commented

Would a VM help? There are free images with dev tools offered by Microsoft. They're quite useful, actually.

Thanks, but I am already testing locally on an actual Windows machine :)

s-m-e commented

Well, it's a pretty "clean" system - very close to what Azure's CI systems run. Maybe it enables you to at least reproduce the error. (Extracting this kind of information from Azure directly is an annoying process, mildly put.)

@s-m-e I ended up releasing 0.4.0 while deactivating most of the Windows CI, as I still cannot repro the underlying issue but a new release was needed anyways for matplotlib 3.4 compatibility.