Performance

Question

Performance

Ravenslofty opened this issue 7 years ago · 15 comments

While I can tell that this code focuses on accuracy, or something like it, it uses float64 throughout, which is very slow.

I wrote a micro-benchmark to demonstrate:

$ go test -bench=.
BenchmarkColourDiffRgb-4                2000000000               0.84 ns/op
BenchmarkColourDiffLab-4                 1000000              1349 ns/op
BenchmarkColourDiffLabNoConversion-4     1000000              1318 ns/op

where ColourDiffRgb is this, ColourDiffLab is this and ColourDiffLabNoConversion is calling DistanceLab() between two colorful.Colors without converting from color.NRGBA.

Perhaps either use some approximations, or do math in either float32, or possibly a fixed-point format.

Answer 1 · 2017-08-24T06:03:35.000Z

Thanks for sharing your findings and code! I agree that I have not designed the library with speed in mind, but correctness and readability.

To be honest, doing all the math, and even storage, in float32 should still be precise enough as to be invisible to the eye, especially given that usually either the input or the output only has 256 distinct values! But I decided to still go for float64 because, except for vectorized expressions, most CPUs internally do computation in high-precision regardless. I'd be curious to see the result after changing the whole code-base to float32, but don't have the time to do so myself.

Going from RGB to LAB passes through XYZ, which passes through LinearRGB. There is an implementation of FastLinearRGB in go-colorful, but one has to go through that explicitly oneself, that could be done via something like XyzToLab(LinearRgbToXyz(col.LinearRgb())) or so. I'm curious if that is actually much faster at all, or not.

Doing everything in uint32 space would definitely be faster, but I think that can get very hairy for complex colorspaces.

What would it take to close this issue, besides a full rewrite of the library? I guess we can also leave it open forever. 😄 Of course, if you find a low-hanging optimization opportunity, I'd happily accept a PR.

Answer 2 · 2017-08-24T10:03:23.000Z

float64 was probably the right call here, since all the functions in math use float64. I tried to convert everything to use float32, but got lost in a myriad of casts to and from functions in the math library.

Here are some more benchmarks. Note that RgbInt (same as Rgb last time) is slower here due to defeating a compiler optimisation, but it's still much faster than Lab.

BenchmarkColourDiffRgbInt-4                     1000000000               2.46 ns/op
BenchmarkColourDiffRgbFloat-4                   100000000               11.3 ns/op
BenchmarkColourDiffLab-4                         1000000              1140 ns/op
BenchmarkColourDiffLabNoConversion-4             1000000              1095 ns/op
BenchmarkColourDiffLabLinearRgb-4                1000000              1076 ns/op
BenchmarkColourDiffLabFastLinearRgb-4            1000000              1130 ns/op

RgbFloat is DistanceRgb().
LinearRgb is a bit faster, but not much. FastLinearRgb() turns out to be slower(!).

I think for my usecase, I do need a rewrite of the library. Question is, how to do it in int-space.

Answer 3 · 2017-08-25T05:43:54.000Z

Uhm, FastLinear being slower than LinearRgb is surprising/fishy. FastLinearRgb has only a Pow while LinearRgb has a Pow but also an additional if, / and +. Unless most calls to LinearRgb actually take the if branch, in which case they avoid the Pow, but that should happen for a small minority of the colors only.

In integer-space, either use a fixed-point library, or use uint32 for high precision and good speed. I believe most conversions in the library only use simple +-*/ with just a very few Pow, Sqrt and Qbrt.

Another opportunity for speed-up would be vectorization and use of SIMD instructions. Especially if you're interested in mainly one use-case in a tight loop, I feel like it should be possible to implement it efficiently in SIMD. I'm curious what you will come up with 😄

Answer 4 · 2017-08-25T20:28:53.000Z

So as an experiment, I changed FastLinearRgb to a further approximation of just squaring the colours. This should be equivalent to a gamma of 2.0, which is not ideal, but it does have its benefits:

BenchmarkColourDiffRgbInt-4                     500000000                2.93 ns/op
BenchmarkColourDiffRgbFloat-4                   100000000               13.5 ns/op
BenchmarkColourDiffLab-4                         1000000              1375 ns/op
BenchmarkColourDiffLabNoConversion-4             1000000              1328 ns/op
BenchmarkColourDiffLabLinearRgb-4                1000000              1277 ns/op
BenchmarkColourDiffLabFastLinearRgb-4            1000000              1380 ns/op
BenchmarkColourDiffLabFasterLinearRgb-4          3000000               439 ns/op

So I think the question is "How inaccurate is a gamma of 2.0?", because that's about a 3x speedup by avoiding three calls to math.Pow.

Edit: I actually plotted the 2.0 gamma curve. It's ~15% off in mean-absolute error compared to your 10%, but actually more accurate than your approximation in mean-squared error. Go figure.

Answer 5 · 2017-08-26T03:57:40.000Z

As a full disclaimer, I didn't come up with the FastLinearRgb approximation, but I also don't remember where I got it from. It's quite unusual for me not to cite the source in the comments.

Thanks to you sharing your benchmarking code, I just played around with this a little. I computed a 4th-order taylor expansion of LinearRgb, which is more accurate than FastLinearRgb (4x more in MAE and 2.6x more in MSE, the difference between MAE and MSE, also in your case, is because of the extremes.) and it is about 4x faster.

Here are my benchmark results (Faster = squaring, Fastest = 4th-order Taylor):

BenchmarkColourDiffRgbInt-8                	2000000000	         1.24 ns/op
BenchmarkColourDiffRgbFloat-8              	300000000	         4.21 ns/op
BenchmarkColourDiffLab-8                   	 2000000	       691 ns/op
BenchmarkColourDiffLabNoConversion-8       	 2000000	       678 ns/op
BenchmarkColourDiffLabLinearRgb-8          	 2000000	       612 ns/op
BenchmarkColourDiffLabFastLinearRgb-8      	 2000000	       692 ns/op
BenchmarkColourDiffLabFasterLinearRgb-8    	10000000	       145 ns/op
BenchmarkColourDiffLabFastestLinearRgb-8   	10000000	       166 ns/op

With this, I will now also compute the taylor approximation of the inverse, and completely replace the Fast implementation in the library, since as-is, it's pretty useless. I will include the notebook for computing the approximation constants and the error in the repo too.

In the meantime, here's my current implementation:

func linearize_fast(v float64) float64 {
    v1 := v - 0.5
    v2 := v1*v1
    v3 := v2*v1
    return -0.248750514614486 + 0.925583310193438*v + 1.16740237321695*v2 + 0.280457026598666*v3
}

func (col Color) FastestLinearRgb() (r, g, b float64) {
    r = linearize_fast(col.R)
    g = linearize_fast(col.G)
    b = linearize_fast(col.B)
    return
}

And here are the curves and errors:

A zoom in the beginning:

A zoom in the middle (the taylor expansions and the original (red) line fit perfectly:

Order-6 taylor is actually not much slower (10ns/op more) but much more precise, so I might implement that one instead.

Answer 6 · 2017-08-26T05:26:25.000Z

See the related pull-request. I'm going to bed now and will finish documentation/derivation during the week-end before merging it. Please have a look and let me know what you think.

Answer 7 · 2017-08-26T22:10:14.000Z

Merged! I'm closing this issue now as it's quite some improvement without the need to dramatically change the library, but if there's more, feel free to open re-open!

Answer 8 · 2017-08-26T23:06:04.000Z

Conversion to L*a*b* is still very slow, but it's feasible to render something now.

Feasible enough to render this, in fact:

Sure, I think RGB wins here by being prettier and 30 times faster:

I feel there is hope, however. Taylor approximation has opened my eyes here, and I think I might be able to get what I want by directly computing the RGB to L*a*b* assuming a D65 colour space.

Answer 9 · 2017-08-27T02:38:42.000Z

Yes, for special cases, I think approximating the whole transform is the way to go. I think you could use the notebook relatively easily for that too.

If you come up with a much faster, but still pretty good, approximation, I will happily accept a PR.

Both of those pictures are beautiful, thank you for sharing them here!

Answer 10 · 2017-08-29T20:29:06.000Z

Hi guys just chipping in. What is the proportion of the time taken to convert Xyz to Lab before doing the actual difference between 2 Lab colors?

The way stdlib implements its colors is that there is a struct per color space. Here everything is stored in Xyz then converted back to the proper color space every time.

If we had a struct colorful.Lab storing l, a and b, we could calculate the distance more easily between 2 lab colors without doing any conversion first.

But it involves a rewrite of the lib and some thinking...

@ZirconiumX If you're looking into absolute performance, I suggest you write some custom code for your needs.

Answer 11 · 2017-08-30T04:38:11.000Z

I see where you're going with different types for the color spaces. It is an interesting suggestion and the resulting library could be quite interesting actually! Though that's too much work for me now, and even more work to do in a backwards-compatible way 😄

Currently, it's up to you not to convert the colors every single time if you can avoid it. You can use the library to convert your colors to Lab, store the l, a and b values wherever, and do the comparison later. But I agree with you that top-performance for one specific conversion is a different beast and requires custom code.

Answer 12 · 2017-08-30T14:35:49.000Z

Well, the issue with said custom code is it's a massive eyesore.

Here's my code for a 2nd order Taylor series from linear RGB to L*a*b*. I didn't feel capable of making the 4th order Taylor series readable.

// Convert from linear RGB to CIE L*a*b* colour space via a 2nd order Taylor series.
// It looks horrific.
func (c Color) fastLab() (l, a, b float64) {
    r1, g1, b1 := c.R - 0.5, c.G - 0.5, c.B - 0.5
    r2, g2, b2 := r1*r1, g1*g1, b1*b1
    l = 0.08353953840061758*b2 - (0.1488331284221014*c.B - 0.0744165642110507)*g1 +
        0.1648123392330386*g2 + (-0.0442601911000214*c.B - 0.438555913233124*c.G + 0.24140805216657268)*r1 +
        0.2030808998232748*r2 +
        0.0721895235504632*c.B + 0.7152961078498866*c.G + 0.2127156955053038*c.R +
        0.03378900708797039

    a = 0.764133475692812*b2 - (0.7899279687584632*c.B - 0.3949639843792316)*g1 +
        1.064211606797602*g2 + (-1.02100612767776*c.B - 1.720893065989955*c.G + 1.3709495968338574)*r1 +
        1.055288756843222*r2 +
        0.7844013275154139*c.B + 1.061153953700172*c.G + 1.751917661948025*c.R +
        0.6970103827379495

    b = 0.139473739305493*b2 + (0.1921736638486771*c.B - 0.09608683192433855)*g1 +
        0.08734031911725595*g2 + (-0.003514546282512035*c.B - 0.7470004222339538*c.G + 0.3752574842582329)*r1 +
        0.3143891636818394*r2 -
        1.297608304391595*c.B + 1.054905483418373*c.G + 0.3378191791989842*c.R +
        0.01842879440624068

    // If you still have eyes at this point, congratulations!

    return
}

This fails accuracy tests, so I would not recommend using it.

Attached is a Sage Jupyter notebook (SymPy is far too slow for this, but Sage is lightning fast).

sage_rgb_to_lab.zip

Answer 13 · 2017-09-03T18:35:19.000Z

Thanks for sharing, also your notebook! This is how most numerical code looks like by the way 😉

Answer 14 · 2017-09-03T20:24:39.000Z

Like, bits are close friends of mine (see something like Dorpsgek). But this math? Ugh.

I think the 6th order Taylor series might be efficient enough, or you could investigate Pade approximants.

In any case, this is very much math beyond my understanding.

Answer 15 · 2017-09-03T21:07:08.000Z

Yes there are definitely faster/better approximations around, I just went with Taylor because I'm used to it and it seemed good enough for this use-case :)