image/jpeg: Decode is slow
Opened this issue · 13 comments
Mac OS Sierra
go version go1.10 darwin/amd64
CPU 3,5 GHz Intel Core i7
I noticed that the use of decode jpeg is very slow.
decode image jpeg 1920x1080
I test github.com/pixiv/go-libjpeg/jpeg and native jpeg
go 1.10 jpeg.decode ≈ 30 ms cpu ≈ 15 %
libjpeg jpeg.decode ≈ 7 ms cpu ≈ 4 %
will it ever go as fast as other libraries?
is it possible that in the next versions the native implementation will become faster?
/cc @horgh @ericlagergren
BenchmarkJPEG-8 1 56942870038 ns/op 12443537192 B/op 1558 allocs/op
BenchmarkPixiv-8 1 19192858155 ns/op 4058312304 B/op 141851 allocs/op
But allocation difference is crazy. I am not sure what's worse.
Change https://golang.org/cl/125138 mentions this issue: image/jpeg: decomposes scan loops and pre-computes values
Performance does seem quite bad. Using stb_image (via cgo) seems to be much faster in a simple test I've done:
2018/11/08 14:58:33 Loaded ../resources/Free_Spring_Blossoms_on_Blue_Creative_Commons_(3457362713).jpg via image: 234.715104ms
2018/11/08 14:58:34 Loaded ../resources/Free_Unedited_Happy_Little_Yellow_Stars_in_Pink_Flower_Creative_Commons_(2898759838).jpg via image: 307.379436ms
vs:
2018/11/08 15:00:07 Loaded ../resources/Free_Spring_Blossoms_on_Blue_Creative_Commons_(3457362713).jpg via stb_image: 93.289255ms
2018/11/08 15:00:07 Loaded ../resources/Free_Unedited_Happy_Little_Yellow_Stars_in_Pink_Flower_Creative_Commons_(2898759838).jpg via stb_image: 167.929126ms
This wasn't a scientific test, but the results were markedly different between the two approaches, and consistently so (+/- 10ms here or there with a few repeats).
I cut what I have into a standalone benchmark, and pushed it up at https://github.com/rburchell/imgtestcase. In my case, the format conversion is costly, but loading the image itself is also far from fast.
A CPU profile of repeatedly loading images gives me this:
flat flat% sum% cum cum%
5340ms 19.51% 19.51% 5340ms 19.51% image/internal/imageutil.DrawYCbCr
5260ms 19.22% 38.73% 9290ms 33.94% image/jpeg.(*decoder).reconstructBlock
4030ms 14.72% 53.45% 4030ms 14.72% image/jpeg.idct
3310ms 12.09% 65.55% 20310ms 74.21% image/jpeg.(*decoder).processSOS
3180ms 11.62% 77.16% 4790ms 17.50% image/jpeg.(*decoder).decodeHuffman
2510ms 9.17% 86.34% 2640ms 9.65% image/jpeg.(*decoder).receiveExtend
920ms 3.36% 89.70% 1740ms 6.36% image/jpeg.(*decoder).ensureNBits
Thanks for creating the benchmarks !
This wasn't a scientific test,
We use benchstat for that. Could you please run each benchmark in a quiet machine with browsers shutdown with -count=10
, and then compare them with benchstat and post the results ?
Sorry, no - I don't have any more time to spend on this. I posted the results on the README, but they are quite easy to run yourself. The benchmarks show the same trend as my original (application) numbers - loading using stb_image via cgo takes around half the time.
I took a look into it. The majority of the improvements in the C version come from SIMD(SSE2) support. If I disable SIMD and then run the benchmarks, the C version runs much slower. Although the Go version is still slow compared to it.
SIMD --
14:52:02-agniva-~/play/go/src/github.com/rburchell/imgtestcase$go test -run=xxx -bench=.
goos: linux
goarch: amd64
pkg: github.com/rburchell/imgtestcase
BenchmarkLoadImageOneSTB-4 20 82806154 ns/op
BenchmarkLoadImageOneGo-4 10 195602342 ns/op 235.85 MB/s
BenchmarkLoadImageTwoSTB-4 10 131362504 ns/op
BenchmarkLoadImageTwoGo-4 3 342820337 ns/op 117.59 MB/s
PASS
ok github.com/rburchell/imgtestcase 8.409s
NO-SIMD--
14:52:31-agniva-~/play/go/src/github.com/rburchell/imgtestcase$go test -run=xxx -bench=.
goos: linux
goarch: amd64
pkg: github.com/rburchell/imgtestcase
BenchmarkLoadImageOneSTB-4 10 162488211 ns/op
BenchmarkLoadImageOneGo-4 10 195514068 ns/op 235.96 MB/s
BenchmarkLoadImageTwoSTB-4 5 204837289 ns/op
BenchmarkLoadImageTwoGo-4 3 343425094 ns/op 117.38 MB/s
PASS
ok github.com/rburchell/imgtestcase 7.246s
Other than that, there are some BCE improvements that can be done in idct and fdct. But that barely gives improvement of 40-50ns.
I don't see any major algorithm improvements that can be applied. All math operations like huffman decoding, idct etc. have their fast paths, which is the same as the C version.
There may be some logic tweaks that can be done in the top time-taking functions like
1070ms 13.39% 44.56% 7000ms 87.61% image/jpeg.(*decoder).processSOS
|------>1100ms 13.77% 31.16% 1360ms 17.02% image/jpeg.(*decoder).refineNonZeroes
|------>870ms 10.89% 55.44% 1720ms 21.53% image/jpeg.(*decoder).reconstructBlock
|------>860ms 10.76% 66.21% 900ms 11.26% image/jpeg.(*decoder).receiveExtend
|------> 26.66% image/jpeg.(*decoder).decodeHuffman
Anybody is welcome to investigate and send CLs.
Change https://golang.org/cl/167417 mentions this issue: image/jpeg: reduce bound checks from idct and fdct
What is the assembly policy for the image/** packages? I'm working on an SSE2 implementation of the IDCT, but I'm wondering if there's a reason there are no .s files in image/**.
We usually stick to pure Go in high level packages like image/. In general, assembly is limited to crypto
, math
and highly specialized bytes
and strings
functions. Also see https://github.com/golang/go/wiki/AssemblyPolicy.
@nigeltao can say whether adding SIMD assembly is apt or not in image/.
First, the AssemblyPolicy (for Go packages overall) applies equally well to image/**
.
Amongst other things, standard library code is often read by people learning Go, so for that code, we favor simplicity and readability over raw performance more than what other Go packages choose. Neither position is wrong, just a different trade-off.
If adding a small amount of SIMD assembly to the standard library gets you a 1.5x benchmark improvement, then I'd probably take it. If adding a large amount of SIMD assembly gets you a 1.05x benchmark improvement, then I'd probably reject it.
What "small" and "large" means is subjective. It's hard to say without a specific SIMD code listing.
Note also that, when I say benchmarks, I'm primarily concerned about overall decode / encode benchmarks, not just FDCT / IDCT benchmarks. Users want to decode a JPEG image, they don't want to run IDCTs directly.
For example, https://go-review.googlesource.com/c/go/+/167417/2//COMMIT_MSG shows a 1.03x benchmark improvement to IDCTs, but no significant change to the overall decode benchmarks. If that change was a SIMD assembly change, I'd reject it as being of too small a benefit, compared to the costs of supporting assembly.