r-lib/vdiffr

JFYI: matplotlib image differ tests

jankatins opened this issue · 4 comments

This is mainly JFYI because it came up on twitter: matplotlib has a similar system in place to do unittesting on their images. It is also used in downstream packages like seaborn. The system is based on comparing raster images and compares the rasterized output of svg, tiff and ps backends to a baseline png which is included in the repo. rasterization is done with ghostscript. I suspect that the rasterize step is there because svgs can produce the same visual but have different internal representations (e.g. when plotting a point and a line, AFAIK the xml can contain point -> line and line -> point).

The workflow is:

  • write testcase with a name in a testfile
  • run once -> fails due to missing baseline images and produces a png image "result_images/testfile/name.png"
  • compare image with your expected image
  • If fine: copy the output to the baseline directory
  • run again -> baseline image is found and plot is compared by drawing the plot on three backends, saving the results (png+ps+svg), saterize svg+ps and comparing the rasterized image to the baseline image.

From my experience with this:

  • The tests should try very hard to make the available installed fonts the same on all test systems (e.g. bitstream vera or something, which can be expected to be available on dev machines and on travis/...; remove any fallbacks in the config; matplotlib actually has a font embedded in the package to have a default)
  • The outputs are not always completely the same due to different systems (e.g. different antialiasing strategies on linux/windows) -> matplotlib has a tolerance parameter for the comparison, but recently tried very hard to remove all non-zero values and was almost successful (but which got again worse when automatic windows tests were introduced).
  • mpl usually removes any text from a plot before it is drawn (a parameter to the comparison function), so different text rendering on axis labels on different systems is not the failure problem...
  • If tolerance is not zero, it's probably best to build plots which look ugly, like increasing the size of printed dots and such things, because small dots can be on totally different positions as expected but this isn't registered because of the tolerance...
  • To reproduce errors on travis/appveyor it's nice if the code spits out a directory which contains the images (+ baseline + diff + html with side-by-side placements of the images for visual inspection), so this can be uploaded (travis) or save as an artifact (appveyor)

A test looks like this:

@image_comparison(baseline_images=['log_scales'], remove_text=True)
def test_log_scales():
    ax = plt.subplot(122, yscale='log', xscale='symlog')

    ax.axvline(24.1)
    ax.axhline(24.1)

-> tests all three images formats (no extensions=['png]) and has a tolerance of 0 (no tol=x) and removes the text. baseline_images is a list because you can have multiple plots in a test (which is IMO not a nice feature...).

The main part is here: https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/testing/compare.py#L268 (mpl is license="BSD")

CC: @hrbrmstr because twitter... :-)

nice. man, i wish there were some other way on both python and r to not use legacy linux font libs (i.e. a nice, modern, cross-platform font lib that support OTF wld be epic)

Thanks for your insights Jan.

My main goal with the initial release of vdiffr is to offer a convenient UI for writing visual tests with testthat and managing failed cases with a workflow based on a Shiny app:

vdiffr

I chose to compare SVG files mainly for convenience. As good as svglite is, it does not offer a completely accurate rendition of R plots. But in most cases, complete accuracy is not necessary for the purpose of testing regressions. I wrote vdiffr with ggplot2 extensions in mind, which are more oriented towards data exploration than creating graphics for publication. The advantage of SVG is that I don't have to deal with tolerance.

It's certainly possible to add backends though. I like how you apply different testing strategies in one go.

@JanSchulz Winston's vtest uses ImageMagik compare of raster images with a tolerence threshold, seems to be more what you had in mind. See https://github.com/hadley/ggplot2/tree/master/visual_test for usage in ggplot2.

This is an old issue, but since it's still open I'll add my two cents: I have found the comparison of svg's extremely valuable. The one thing I can do with svg's that I can't do with raster images is diff the new image against the old and hunt down exactly what has changed. I do this regularly, in particular when I don't see a difference visually but vdiffr tells me the images aren't the same. I find it helpful to understand why vdiffr thinks the images are different and what in the code changed to cause those differences. With raster images, you're mostly flying blind.

Example: this is a case where the visual tests failed because changes in the calculation of axis tick locations resulted in slightly different locations for the ticks and labels.
tidyverse/ggplot2@51c6d53#diff-c75903e4bd3c74e786f3b2825a1a804f