elixir-image/image

Test stability

Closed this issue · 5 comments

a8t commented

We have already had some discussion about tests, so I wanted to centralize discussion here.

If I were to summarize the major issue with tests for this codebases, I'd say that assertions are hard.

Right now, tests compare calculated image similarity to pre-composed test images. The current implementation compares the calculated image to the pre-comp pixel-by-pixel using Image.Math.==, which produces a third image with white pixels where they two inputs match and black pixels where they don't. Then we check how many are black vs white.

The problem is that our image similarity assertions sometimes give false failures on images that are visually match (see Kip's comment below).

Yes, this is tricky.

Thanks your great work, image matching in tests is more tolerant of the small differences that occur on systems that have different builds of libvips. Small differences are to be expected given the different potential underlying image container libraries.

However the tests that use Image.thumbnail/2 seems to show much greater variance even though visually the images match.

Perhaps one pragmatic approach is to allow a configurable threshold tolerance to assert_images_equal/2?

a8t commented

The need to have a threshold indicates to me that the approach to testing is... not wrong, just not always right. "if all you have is a hammer, everything looks like a nail". if we allow to configure for a low threshold, we are simply asking for a bigger and bigger hammer, instead of maybe a screwdriver.

In this metaphor that I'm drawing, I think that Image.Math.== is our hammer. The problem is that it has a very strict definition of equality, which is reasonable (ie we shouldn't change that definition just for our tests, of course). But if we are limited to using it for assertions, then we run into the trouble you described. Because the images are visually (human eye) identical, but different enough by Image.Math.== that they fail tests, and that violates our expectation.

So we have a few questions ahead of us. What's our screwdriver? What's an alternative way of testing the image similarity that doesn't require pixel-by-pixel equality? I saw ideas about gaussian blurring then histogramming... we can see about that. That's one question (A).

My other questions are:
(B) Do we actually need a more tolerant approach to image similarity? Is there some way to ensure that the test pre-comps are definitely up-to-date with the same container libraries as Image will use?

(C) is image similarity the only way to test the correctness of the code? Is there some way to assert against, maybe the Vips objects that are created, and offload correctness to their test suite?

It's hard to know which question to tackle first. The three are all different ways of solving the false failures problem, and there may be more.

a8t commented

I think part of the research (which I will dig into today) is to understand why / in what way thumbnails test so low in image similarity.

One way to ensure container compatibility is to follow the approach of Nx and OpenCV and build binaries using known and specific dependencies for libvips. This would also potentially improve security management and CVE mitigation. I've created an issue for this (and definitely looking for help!)

I have adjusted the great work you did a little bit to use the sum of squares of differences between the two images. As documented in the libvips discussion. This resolves the matching of all but two tests. I will close this for now and we can revisit when greater understanding on why the visual similarity for significant test variance happens for those two tests.

Thanks again for all your support on this.