Performance
Ravenslofty opened this issue ยท 15 comments
While I can tell that this code focuses on accuracy, or something like it, it uses float64
throughout, which is very slow.
I wrote a micro-benchmark to demonstrate:
$ go test -bench=.
BenchmarkColourDiffRgb-4 2000000000 0.84 ns/op
BenchmarkColourDiffLab-4 1000000 1349 ns/op
BenchmarkColourDiffLabNoConversion-4 1000000 1318 ns/op
where ColourDiffRgb is this, ColourDiffLab is this and ColourDiffLabNoConversion is calling DistanceLab() between two colorful.Colors without converting from color.NRGBA.
Perhaps either use some approximations, or do math in either float32, or possibly a fixed-point format.
Thanks for sharing your findings and code! I agree that I have not designed the library with speed in mind, but correctness and readability.
To be honest, doing all the math, and even storage, in float32
should still be precise enough as to be invisible to the eye, especially given that usually either the input or the output only has 256 distinct values! But I decided to still go for float64
because, except for vectorized expressions, most CPUs internally do computation in high-precision regardless. I'd be curious to see the result after changing the whole code-base to float32, but don't have the time to do so myself.
Going from RGB to LAB passes through XYZ, which passes through LinearRGB. There is an implementation of FastLinearRGB in go-colorful, but one has to go through that explicitly oneself, that could be done via something like XyzToLab(LinearRgbToXyz(col.LinearRgb()))
or so. I'm curious if that is actually much faster at all, or not.
Doing everything in uint32
space would definitely be faster, but I think that can get very hairy for complex colorspaces.
What would it take to close this issue, besides a full rewrite of the library? I guess we can also leave it open forever. ๐ Of course, if you find a low-hanging optimization opportunity, I'd happily accept a PR.
float64
was probably the right call here, since all the functions in math
use float64
. I tried to convert everything to use float32
, but got lost in a myriad of casts to and from functions in the math
library.
Here are some more benchmarks. Note that RgbInt (same as Rgb last time) is slower here due to defeating a compiler optimisation, but it's still much faster than Lab.
BenchmarkColourDiffRgbInt-4 1000000000 2.46 ns/op
BenchmarkColourDiffRgbFloat-4 100000000 11.3 ns/op
BenchmarkColourDiffLab-4 1000000 1140 ns/op
BenchmarkColourDiffLabNoConversion-4 1000000 1095 ns/op
BenchmarkColourDiffLabLinearRgb-4 1000000 1076 ns/op
BenchmarkColourDiffLabFastLinearRgb-4 1000000 1130 ns/op
RgbFloat is DistanceRgb().
LinearRgb is a bit faster, but not much. FastLinearRgb() turns out to be slower(!).
I think for my usecase, I do need a rewrite of the library. Question is, how to do it in int-space.
Uhm, FastLinear
being slower than LinearRgb
is surprising/fishy. FastLinearRgb
has only a Pow
while LinearRgb
has a Pow
but also an additional if
, /
and +
. Unless most calls to LinearRgb
actually take the if
branch, in which case they avoid the Pow
, but that should happen for a small minority of the colors only.
In integer-space, either use a fixed-point library, or use uint32 for high precision and good speed. I believe most conversions in the library only use simple +-*/
with just a very few Pow
, Sqrt
and Qbrt
.
Another opportunity for speed-up would be vectorization and use of SIMD instructions. Especially if you're interested in mainly one use-case in a tight loop, I feel like it should be possible to implement it efficiently in SIMD. I'm curious what you will come up with ๐
So as an experiment, I changed FastLinearRgb
to a further approximation of just squaring the colours. This should be equivalent to a gamma of 2.0, which is not ideal, but it does have its benefits:
BenchmarkColourDiffRgbInt-4 500000000 2.93 ns/op
BenchmarkColourDiffRgbFloat-4 100000000 13.5 ns/op
BenchmarkColourDiffLab-4 1000000 1375 ns/op
BenchmarkColourDiffLabNoConversion-4 1000000 1328 ns/op
BenchmarkColourDiffLabLinearRgb-4 1000000 1277 ns/op
BenchmarkColourDiffLabFastLinearRgb-4 1000000 1380 ns/op
BenchmarkColourDiffLabFasterLinearRgb-4 3000000 439 ns/op
So I think the question is "How inaccurate is a gamma of 2.0?", because that's about a 3x speedup by avoiding three calls to math.Pow
.
Edit: I actually plotted the 2.0 gamma curve. It's ~15% off in mean-absolute error compared to your 10%, but actually more accurate than your approximation in mean-squared error. Go figure.
As a full disclaimer, I didn't come up with the FastLinearRgb
approximation, but I also don't remember where I got it from. It's quite unusual for me not to cite the source in the comments.
Thanks to you sharing your benchmarking code, I just played around with this a little. I computed a 4th-order taylor expansion of LinearRgb
, which is more accurate than FastLinearRgb
(4x more in MAE and 2.6x more in MSE, the difference between MAE and MSE, also in your case, is because of the extremes.) and it is about 4x faster.
Here are my benchmark results (Faster = squaring, Fastest = 4th-order Taylor):
BenchmarkColourDiffRgbInt-8 2000000000 1.24 ns/op
BenchmarkColourDiffRgbFloat-8 300000000 4.21 ns/op
BenchmarkColourDiffLab-8 2000000 691 ns/op
BenchmarkColourDiffLabNoConversion-8 2000000 678 ns/op
BenchmarkColourDiffLabLinearRgb-8 2000000 612 ns/op
BenchmarkColourDiffLabFastLinearRgb-8 2000000 692 ns/op
BenchmarkColourDiffLabFasterLinearRgb-8 10000000 145 ns/op
BenchmarkColourDiffLabFastestLinearRgb-8 10000000 166 ns/op
With this, I will now also compute the taylor approximation of the inverse, and completely replace the Fast
implementation in the library, since as-is, it's pretty useless. I will include the notebook for computing the approximation constants and the error in the repo too.
In the meantime, here's my current implementation:
func linearize_fast(v float64) float64 {
v1 := v - 0.5
v2 := v1*v1
v3 := v2*v1
return -0.248750514614486 + 0.925583310193438*v + 1.16740237321695*v2 + 0.280457026598666*v3
}
func (col Color) FastestLinearRgb() (r, g, b float64) {
r = linearize_fast(col.R)
g = linearize_fast(col.G)
b = linearize_fast(col.B)
return
}
And here are the curves and errors:
A zoom in the middle (the taylor expansions and the original (red) line fit perfectly:
Order-6 taylor is actually not much slower (10ns/op more) but much more precise, so I might implement that one instead.
See the related pull-request. I'm going to bed now and will finish documentation/derivation during the week-end before merging it. Please have a look and let me know what you think.
Merged! I'm closing this issue now as it's quite some improvement without the need to dramatically change the library, but if there's more, feel free to open re-open!
Conversion to L*a*b* is still very slow, but it's feasible to render something now.
Feasible enough to render this, in fact:
Sure, I think RGB wins here by being prettier and 30 times faster:
I feel there is hope, however. Taylor approximation has opened my eyes here, and I think I might be able to get what I want by directly computing the RGB to L*a*b* assuming a D65 colour space.
Yes, for special cases, I think approximating the whole transform is the way to go. I think you could use the notebook relatively easily for that too.
If you come up with a much faster, but still pretty good, approximation, I will happily accept a PR.
Both of those pictures are beautiful, thank you for sharing them here!
Hi guys just chipping in. What is the proportion of the time taken to convert Xyz to Lab before doing the actual difference between 2 Lab colors?
The way stdlib implements its colors is that there is a struct per color space. Here everything is stored in Xyz then converted back to the proper color space every time.
If we had a struct colorful.Lab
storing l
, a
and b
, we could calculate the distance more easily between 2 lab colors without doing any conversion first.
But it involves a rewrite of the lib and some thinking...
@ZirconiumX If you're looking into absolute performance, I suggest you write some custom code for your needs.
I see where you're going with different types for the color spaces. It is an interesting suggestion and the resulting library could be quite interesting actually! Though that's too much work for me now, and even more work to do in a backwards-compatible way ๐
Currently, it's up to you not to convert the colors every single time if you can avoid it. You can use the library to convert your colors to Lab, store the l
, a
and b
values wherever, and do the comparison later. But I agree with you that top-performance for one specific conversion is a different beast and requires custom code.
Well, the issue with said custom code is it's a massive eyesore.
Here's my code for a 2nd order Taylor series from linear RGB to L*a*b*. I didn't feel capable of making the 4th order Taylor series readable.
// Convert from linear RGB to CIE L*a*b* colour space via a 2nd order Taylor series.
// It looks horrific.
func (c Color) fastLab() (l, a, b float64) {
r1, g1, b1 := c.R - 0.5, c.G - 0.5, c.B - 0.5
r2, g2, b2 := r1*r1, g1*g1, b1*b1
l = 0.08353953840061758*b2 - (0.1488331284221014*c.B - 0.0744165642110507)*g1 +
0.1648123392330386*g2 + (-0.0442601911000214*c.B - 0.438555913233124*c.G + 0.24140805216657268)*r1 +
0.2030808998232748*r2 +
0.0721895235504632*c.B + 0.7152961078498866*c.G + 0.2127156955053038*c.R +
0.03378900708797039
a = 0.764133475692812*b2 - (0.7899279687584632*c.B - 0.3949639843792316)*g1 +
1.064211606797602*g2 + (-1.02100612767776*c.B - 1.720893065989955*c.G + 1.3709495968338574)*r1 +
1.055288756843222*r2 +
0.7844013275154139*c.B + 1.061153953700172*c.G + 1.751917661948025*c.R +
0.6970103827379495
b = 0.139473739305493*b2 + (0.1921736638486771*c.B - 0.09608683192433855)*g1 +
0.08734031911725595*g2 + (-0.003514546282512035*c.B - 0.7470004222339538*c.G + 0.3752574842582329)*r1 +
0.3143891636818394*r2 -
1.297608304391595*c.B + 1.054905483418373*c.G + 0.3378191791989842*c.R +
0.01842879440624068
// If you still have eyes at this point, congratulations!
return
}
This fails accuracy tests, so I would not recommend using it.
Attached is a Sage Jupyter notebook (SymPy is far too slow for this, but Sage is lightning fast).
Thanks for sharing, also your notebook! This is how most numerical code looks like by the way ๐
Like, bits are close friends of mine (see something like Dorpsgek). But this math? Ugh.
I think the 6th order Taylor series might be efficient enough, or you could investigate Pade approximants.
In any case, this is very much math beyond my understanding.
Yes there are definitely faster/better approximations around, I just went with Taylor because I'm used to it and it seemed good enough for this use-case :)