pointer-params test case behaves differently on Windows

Question

pointer-params test case behaves differently on Windows

Wuvist opened this issue 4 years ago · 7 comments

In pointer-params.md, it says

func (v *vec2) add(u *vec2) *vec2 {
	v.x += u.x
	v.y += u.y
	v.z += u.z
	v.w += u.w
	return v
}

is slower than the vec1, due to "inlining". But NO, the main reason that it's slow it because it has more lines of intructions, just change the function to:

func (v *vec2) add(u *vec2) *vec2 {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.y+u.y, v.w+u.w
	return v
}

And vec2 out perform vec1 in both inline & noinline case.

Answer 1 · 2020-10-31T06:57:31.000Z

Hi, thanks for the questioning.

Unfortunately, I am not able to reproduce your test case. Your suggested vec2.add implementation (see vec3)does not show any difference to the existing vec2 even with a different number of cores:

package main

import "testing."

type vec1 struct {
        x, y, z, w float64
}

func (v vec1) add(u vec1) vec1 {
        return vec1{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

type vec2 struct {
        x, y, z, w float64
}

func (v *vec2) add(u *vec2) *vec2 {
        v.x += u.x
        v.y += u.y
        v.z += u.z
        v.w += u.w
        return v
}

type vec3 struct {
        x, y, z, w float64
}

func (v *vec3) add(u *vec3) *vec3 {
        v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
        return v
}

func BenchmarkVec(b *testing.B) {
        b.ReportAllocs()
        b.Run("vec1", func(b *testing.B) {
                v1 := vec1{1, 2, 3, 4}
                v2 := vec1{4, 5, 6, 7}
                b.ResetTimer()
                for i := 0; i < b.N; i++ {
                        if i%2 == 0 {
                                v1 = v1.add(v2)
                        } else {
                                v2 = v2.add(v1)
                        }
                }
        })
        b.Run("vec2", func(b *testing.B) {
                v1 := vec2{1, 2, 3, 4}
                v2 := vec2{4, 5, 6, 7}
                b.ResetTimer()
                for i := 0; i < b.N; i++ {
                        if i%2 == 0 {
                                v1.add(&v2)
                        } else {
                                v2.add(&v1)
                        }
                }
        })
        b.Run("vec3", func(b *testing.B) {
                v1 := vec3{1, 2, 3, 4}
                v2 := vec3{4, 5, 6, 7}
                b.ResetTimer()
                for i := 0; i < b.N; i++ {
                        if i%2 == 0 {
                                v1.add(&v2)
                        } else {
                                v2.add(&v1)
                        }
                }
        })
}

$ benchstat b.txt
name          time/op
Vec/vec1      0.49ns ± 1%
Vec/vec1-2    0.49ns ± 1%
Vec/vec1-4    0.50ns ± 1%
Vec/vec1-8    0.50ns ± 2%
Vec/vec1-16   0.49ns ± 0%
Vec/vec1-32   0.49ns ± 1%
Vec/vec1-64   0.50ns ± 1%
Vec/vec1-128  0.49ns ± 0%
Vec/vec2      2.20ns ± 0%
Vec/vec2-2    2.20ns ± 0%
Vec/vec2-4    2.20ns ± 0%
Vec/vec2-8    2.20ns ± 0%
Vec/vec2-16   2.20ns ± 0%
Vec/vec2-32   2.20ns ± 0%
Vec/vec2-64   2.20ns ± 0%
Vec/vec2-128  2.20ns ± 0%
Vec/vec3      2.20ns ± 0%
Vec/vec3-2    2.20ns ± 0%
Vec/vec3-4    2.20ns ± 0%
Vec/vec3-8    2.20ns ± 0%
Vec/vec3-16   2.20ns ± 0%
Vec/vec3-32   2.20ns ± 0%
Vec/vec3-64   2.20ns ± 0%
Vec/vec3-128  2.20ns ± 0%

$ go version
go version go1.15.3 linux/amd64

$ inxi -C
CPU:       Topology: 8-Core model: Intel Core i9-9900K bits: 64 type: MT MCP L2 cache: 16.0 MiB
           Speed: 800 MHz min/max: 800/5000 MHz Core speeds (MHz): 1: 842 2: 844 3: 1906 4: 1254 5: 1239
           6: 1569 7: 1197 8: 3851 9: 4314 10: 4724 11: 4666 12: 1578 13: 4639 14: 4416 15: 4519 16: 1484

I don't know how you did your benchmarks, but mind to be careful about those micro-benchmark tests, do in under a controlled environment, lock your CPU clock frequency, etc. See https://golang.design/s/gobench for more details.

If you still produce the same result, it would be fascinating to me and all of us, and I would appreciate it if you could attach a detailed procedure on how you did your benchmark.

Your suggested change indeed produces fewer instructions, but the core reason is the pointer version is under a different assembly addressing mode.
Which was described in https://github.com/golang-design/research/blob/master/pointer-params.md#unoptimized-move-semantics.

Maybe the section was not well written (it gives the impression that the additional cost is because of the additional MOVQ instructions, but the actual reason was described in the last second paragraph); I will think about it and do some of the rewordings.

Answer 2 · 2020-10-31T08:18:00.000Z

My testing code is at: https://gist.github.com/Wuvist/214691d698d0bdb850b1888016a3cabe

Origianlly I ran testing using: go1.14.2 windows/amd64

Which using one line assignment make significant difference and vec2 out perform vec1:

BUT, after I upgrade to go1.15.3 windows/amd64, 1 or 4 line assignment doesn't make a difference, and vec1 perform better.

Guess there are some compiler optimization has been done in go's recent versions.

Another thing is that moving the variable declaration also have significant performance impact on vec2:
https://gist.github.com/Wuvist/214691d698d0bdb850b1888016a3cabe#file-vec_test-go-L41 vs your version.

Guess this is due to pointer escape analysis.

Answer 3 · 2020-10-31T09:22:52.000Z

Hi, thanks for the further tests.

Unfortuanately, I still cannot reproduce your result. I benchmarked go1.14.10 and go1.15.3, and the later one performs even worse than 1.14 releases:

$ benchstat bench114.txt bench115.txt
name         old time/op    new time/op    delta
Vec/add1-16    0.25ns ± 1%    0.49ns ± 1%  +99.55%  (p=0.000 n=10+10)
Vec/add2-16    2.20ns ± 0%    2.20ns ± 0%     ~     (all equal)
Vec/add3-16    2.20ns ± 0%    2.20ns ± 0%     ~     (all equal)

name         old alloc/op   new alloc/op   delta
Vec/add1-16     0.00B          0.00B          ~     (all equal)
Vec/add2-16     0.00B          0.00B          ~     (all equal)
Vec/add3-16     0.00B          0.00B          ~     (all equal)

name         old allocs/op  new allocs/op  delta
Vec/add1-16      0.00           0.00          ~     (all equal)
Vec/add2-16      0.00           0.00          ~     (all equal)
Vec/add3-16      0.00           0.00          ~     (all equal)

Here is my benchmark:

type vec struct {
	x, y, z, w float64
}

func (v vec) add1(u vec) vec {
	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
}

func (v *vec) add2(u *vec) *vec {
	v.x += u.x
	v.y += u.y
	v.z += u.z
	v.w += u.w
	return v
}

func (v *vec) add3(u *vec) *vec {
	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
	return v
}

func BenchmarkVec(b *testing.B) {
	b.Log("go version: ", runtime.Version())
	b.Run("add1", func(b *testing.B) {
		v1 := vec{1, 2, 3, 4}
		v2 := vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.add1(v2)
			} else {
				v2 = v2.add1(v1)
			}
		}
	})
	b.Run("add2", func(b *testing.B) {
		v1 := &vec{1, 2, 3, 4}
		v2 := &vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.add2(v2)
			} else {
				v2 = v2.add2(v1)
			}
		}
	})
	b.Run("add3", func(b *testing.B) {
		v1 := &vec{1, 2, 3, 4}
		v2 := &vec{4, 5, 6, 7}
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.add3(v2)
			} else {
				v2 = v2.add3(v1)
			}
		}
	})
}

From what saw in your figures, you only conducted a single go test -bench command, which is considered as an invalid benchmark procedure for these micro tests. Apart from that, you also run your tests on windows, which I am not aware of any way to conduct benchmarks on Windows reliably. It would be great if you could explain why your test result is valid and provide more information on what your CPU model is and what's your OS version?

Answer 4 · 2020-11-02T14:23:05.000Z

CPU: i7-7700K @ 4.20GHz
OS: Windows 10 Pro 2004 OS build 19041.572

Answer 5 · 2020-11-02T14:38:19.000Z

One more interesting test result for reference:

func BenchmarkVec(b *testing.B) {
	v1 := vec1{1, 2, 3, 4}
	v2 := vec1{4, 5, 6, 7}

	b.Run("vec1", func(b *testing.B) {
		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v1 = v1.add(v2)
			} else {
				v2 = v2.add(v1)
			}
		}
	})

	b.Run("vec2", func(b *testing.B) {
		v3 := vec1{1, 2, 3, 4}
		v4 := vec1{4, 5, 6, 7}

		b.ReportAllocs()
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			if i%2 == 0 {
				v3 = v3.add(v4)
			} else {
				v4 = v4.add(v3)
			}
		}
	})
}

go version go1.15.3 windows/amd64
$ go test -bench "."

goos: windows
goarch: amd64
pkg: vec
BenchmarkVec/vec1-8             535715241                2.21 ns/op            0 B/op          0 allocs/op
BenchmarkVec/vec2-8             1000000000               0.247 ns/op           0 B/op          0 allocs/op
PASS
ok      vec     1.727s

Answer 6 · 2020-11-03T18:20:41.000Z

Great. Thanks for sharing your additional results. I just got the time to run your code here are the test result:

$ benchstat go114.txt go115.txt tip.txt
name \ time/op       go114.txt    go115.txt    tip.txt
Value-16             0.49ns ± 2%  0.25ns ± 1%  0.25ns ± 0%
Pointer-16           0.37ns ± 1%  0.37ns ± 1%  0.37ns ± 0%
ValueNoinline-16     0.49ns ± 1%  0.25ns ± 2%  0.25ns ± 1%
PointerNoinline-16   0.37ns ± 1%  0.37ns ± 1%  0.37ns ± 1%
Vec/vec1-16          2.20ns ± 0%  2.20ns ± 1%  2.20ns ± 0%
Vec/vec2-16          2.37ns ± 0%  2.38ns ± 0%  2.37ns ± 1%
VecNoinline/vec1-16  2.21ns ± 1%  2.20ns ± 0%  2.22ns ± 1%
VecNoinline/vec2-16  2.38ns ± 1%  2.38ns ± 2%  2.37ns ± 1%
$ inxi -C
CPU:       Topology: 8-Core model: Intel Core i9-9900K bits: 64 type: MT MCP L2 cache: 16.0 MiB
           Speed: 800 MHz min/max: 800/5000 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 4: 800 5: 800 6: 800 7: 800 8: 800
           9: 800 10: 800 11: 800 12: 800 13: 800 14: 800 15: 800 16: 800
$ uname -a
Linux changkun-perflock 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

From what I skimmed on your benchmark:

BenchmarkValue
BenchmarkPointer
BenchmarkValueNoinline
BenchmarkPointerNoinline

These four are really just invalid measurement of vec.add because you also measured the allocation of the vector no matter whether it is allocated on stack or heap. It will have to subtract the allocation time from the measurement. We could argue that the allocation time is a constant and the results show some of the relativity, but still, it is not an solid measurement.

About the BenchmarkVec and BenchmarkVecNoinline, they don't show any significant difference between 1.14, 1.15, and tip. So I assume this is an OS related issue. As I said, there are rare known practices about how to conduct reliable microbenchmarks on windows (would be great if you could share your experience and practices on TalkGo @yangwenmai ).

Based on the above discussion, I would argue and conclude this issue as an OS-specific issue, whereas I don't have a windows test machine for it. Leave the issue open for 2 more weeks for further discussion. Otherwise, it will be closed.

Answer 7 · 2020-11-18T07:49:58.000Z

No change in consensus. Close.