golang/go

cmd/compile: SSA performance regression due to array zeroing

Opened this issue · 14 comments

go version devel +bbd1dcd Wed Jun 1 21:32:46 2016 +0000 darwin/amd64

https://github.com/hydroflame/go1.7bench
This very simple program take 2 4 element vector and performs component wise addition.
If you run the benchmark with SSA and without you will see that it is much faster without SSA.

$ go test -run NONE -bench . -gcflags -ssa=0 > nossa.txt
testing: warning: no tests to run
$ go test -run NONE -bench . -gcflags -ssa=1 > ssa.txt
testing: warning: no tests to run
$ benchcmp nossa.txt ssa.txt 
benchmark          old ns/op     new ns/op     delta
BenchmarkAdd-4     2.45          11.4          +365.31%

I expect that the binary generated with SSA would be at least the same speed as the old one.

This particular example is extracted from one of my bigger library: github.com/luxengine/glm
for more example of SSA bad performance run benchmarks in

  • github.com/luxengine/glm
  • github.com/luxengine/glm/geo

The SSA backend doesn't handle arrays terribly well, as they are not SSAable. This leads to some extra copies that the old backend was able to avoid.

add is inlined in this test. The inner loop of the BenchmarkAdd is (with ssa):

    0x0047 00071 (vec_test.go:12)   MOVQ    $0, "".~r1(SP)
    0x004f 00079 (vec_test.go:12)   MOVQ    $0, "".~r1+8(SP)
    0x0058 00088 (vec_test.go:12)   MOVQ    $0, "".autotmp_5+48(SP)
    0x0061 00097 (vec_test.go:12)   MOVQ    $0, "".autotmp_5+56(SP)
    0x006a 00106 (vec_test.go:12)   MOVSS   "".v0+32(SP), X0
    0x0070 00112 (vec_test.go:12)   MOVSS   "".v1+16(SP), X1
    0x0076 00118 (vec_test.go:12)   ADDSS   X1, X0
    0x007a 00122 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+48(SP)
    0x0080 00128 (vec_test.go:12)   MOVSS   "".v0+36(SP), X0
    0x0086 00134 (vec_test.go:12)   MOVSS   "".v1+20(SP), X1
    0x008c 00140 (vec_test.go:12)   ADDSS   X1, X0
    0x0090 00144 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+52(SP)
    0x0096 00150 (vec_test.go:12)   MOVSS   "".v0+40(SP), X0
    0x009c 00156 (vec_test.go:12)   MOVSS   "".v1+24(SP), X1
    0x00a2 00162 (vec_test.go:12)   ADDSS   X1, X0
    0x00a6 00166 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+56(SP)
    0x00ac 00172 (vec_test.go:12)   MOVSS   "".v0+44(SP), X0
    0x00b2 00178 (vec_test.go:12)   MOVSS   "".v1+28(SP), X1
    0x00b8 00184 (vec_test.go:12)   ADDSS   X1, X0
    0x00bc 00188 (vec_test.go:12)   MOVSS   X0, "".autotmp_5+60(SP)
    0x00c2 00194 (vec_test.go:12)   MOVUPS  "".autotmp_5+48(SP), X0
    0x00c7 00199 (vec_test.go:12)   MOVUPS  X0, "".~r1(SP)
    0x00cb 00203 (vec_test.go:12)   MOVUPS  "".~r1(SP), X0
    0x00cf 00207 (vec_test.go:12)   MOVUPS  X0, "".tmp(SB)
    0x00d6 00214 (vec_test.go:11)   INCQ    CX
    0x00d9 00217 (vec_test.go:11)   MOVQ    184(AX), DX
    0x00e0 00224 (vec_test.go:11)   CMPQ    CX, DX
    0x00e3 00227 (vec_test.go:11)   JLT $0, 71

Note the unnecessary zeroings of ~r1 and autotmp_5, and the copy autotmp5 -> ~r1 -> tmp.

    0x0090 00144 (vec_test.go:12)   LEAQ    "".v0+16(SP), BX
    0x0095 00149 (vec_test.go:12)   MOVQ    BX, CX
    0x0098 00152 (vec_test.go:12)   LEAQ    "".v1(SP), BX
    0x009c 00156 (vec_test.go:12)   MOVQ    BX, AX
    0x009f 00159 (vec_test.go:12)   MOVQ    $0, BX
    0x00a1 00161 (vec_test.go:12)   XORPS   X0, X0
    0x00a4 00164 (vec_test.go:12)   MOVQ    $0, BX
    0x00a6 00166 (vec_test.go:12)   MOVSS   (CX), X4
    0x00aa 00170 (vec_test.go:12)   MOVSS   (AX), X1
    0x00ae 00174 (vec_test.go:12)   ADDSS   X1, X4
    0x00b2 00178 (vec_test.go:12)   MOVSS   4(CX), X3
    0x00b7 00183 (vec_test.go:12)   MOVSS   4(AX), X1
    0x00bc 00188 (vec_test.go:12)   ADDSS   X1, X3
    0x00c0 00192 (vec_test.go:12)   MOVSS   8(CX), X2
    0x00c5 00197 (vec_test.go:12)   MOVSS   8(AX), X1
    0x00ca 00202 (vec_test.go:12)   ADDSS   X1, X2
    0x00ce 00206 (vec_test.go:12)   MOVSS   12(CX), X0
    0x00d3 00211 (vec_test.go:12)   MOVSS   12(AX), X1
    0x00d8 00216 (vec_test.go:12)   ADDSS   X1, X0
    0x00dc 00220 (vec_test.go:12)   MOVSS   X4, "".tmp(SB)
    0x00e4 00228 (vec_test.go:12)   MOVSS   X3, "".tmp+4(SB)
    0x00ec 00236 (vec_test.go:12)   MOVSS   X2, "".tmp+8(SB)
    0x00f4 00244 (vec_test.go:12)   MOVSS   X0, "".tmp+12(SB)
    0x00fc 00252 (vec_test.go:11)   INCQ    DX
    0x00ff 00255 (vec_test.go:11)   MOVQ    184(SI), BX
    0x0106 00262 (vec_test.go:11)   CMPQ    BX, DX
    0x0109 00265 (vec_test.go:11)   JGT $0, 144

The old compiler does the direct write to the destination.

This is on my radar to fix, but it isn't easy. Hopefully for 1.8.
See also #14762, it applies to the uninlined version of add.

If I changed for vec4{x,y,z,w float32} would that solve the problem ?

It might. Another workaround is to not pass around vectors by value. Use []float32 or *[4]float32 everywhere. (It is a bit odd that your add takes *vec4 as args but returns a vec4. It might be better anyway to do func (r *vec4) add(v0, v1 *vec4))

I tested it, it's definitely faster with structs

rsc commented

Why are structs and small arrays treated differently? Is blasting apart small arrays (the same way small structs are blasted apart) an easy optimization to add? I believe the old back ends did this for arrays of size <= 4 or something like that.

/cc @dr2chase @randall77 @cherrymui

I believe it does not decompose arrays because it doesn't know how to handle non-constant indexing. There is a TODO in gc/ssa.go:

    case TARRAY:
        // We can't do arrays because dynamic indexing is
        // not supported on SSA variables.
        // TODO: maybe allow if length is <=1?  All indexes
        // are constant?  Might be good for the arrays
        // introduced by the compiler for variadic functions.
rsc commented

Interesting. So the old back end, because it just worked on stack words, saw that these words were being used, never had their address taken, and completely registerized them. And SSA isn't smart enough to do that because it decides about variable tracking early? The implication here, though, is that all the array refs are constant indexes. Maybe if you know the array is only ever accessed by constant index (and address not taken, but that's needed for structs already) then it can be split apart?

Yes, if we knew that all accesses were constant indexes we could registerize it.
Or if it is [1] we can registerize it, because the only index that can get past the bounds check is 0.

I was measuring different approaches to implementing Vector.Add. Almost made a new issue, but eventually found this.

One specific case stood out: https://play.golang.org/p/vzC344lntX. There is a 20x perfomance difference.

type Struct struct{ X, Y, Z float32 }
func (a Struct) Add(b Struct) Struct { return Struct{a.X + b.X, a.Y + b.Y, a.Z + b.Z} }

type Array [3]float32
func (a Array) Add(b Array) Array { return Array{a[0] + b[0], a[1] + b[1], a[2] + b[2]} }

BenchmarkStructAdd-8               2000000000               0.98 ns/op
BenchmarkArrayAdd-8                100000000               20.7 ns/op

Exhaustive benchmark of different approaches: https://play.golang.org/p/6diXIPOWu5

        Struct         Array
PP      2.91 ns/op     3.00 ns/op
PN      2.93 ns/op     2.94 ns/op
PPP     4.89 ns/op     9.41 ns/op
NNN     0.98 ns/op    20.70 ns/op
PPN     2.91 ns/op    17.70 ns/op
PNP     4.89 ns/op     8.94 ns/op
NPP     2.93 ns/op    15.20 ns/op
NNP     2.93 ns/op    14.30 ns/op
NPN     1.00 ns/op    21.00 ns/op
PNN     2.93 ns/op    15.00 ns/op
PPZ     2.92 ns/op     9.18 ns/op
NPZ     2.91 ns/op    14.90 ns/op
PNZ     2.92 ns/op     8.85 ns/op
NNZ     2.93 ns/op    14.30 ns/op
PPY     2.96 ns/op     2.94 ns/op
NPY     2.92 ns/op     8.74 ns/op
PNY     2.92 ns/op     2.95 ns/op
NNY     2.93 ns/op     9.05 ns/op

(go version devel +11eaf42 Sat Apr 29 04:15:49 2017 +0000 windows/amd64)

I'm working on a change that allows const indexed small arrays to be SSA'd. I'm not 100% happy with the approach (storing a flag on the array type instead of on the node to indicate that it was variably indexed), so I'm taking another look but it does work and will solve this:

goos: linux
goarch: amd64
BenchmarkStruct-8   	2000000000	         0.78 ns/op
BenchmarkArray-8    	2000000000	         0.78 ns/op
PASS
ok  	_/tmp/foo	3.277s

@tzneal wouldn't it be possible to rewrite variably indexed statements into switches? Just as a proof of concept: https://play.golang.org/p/m0n1E7Q2kg (I'm sure there is a way to avoid the performance hit introduced by the panic, when you have access to codegen)

BenchmarkStruct-8               2000000000               0.98 ns/op
BenchmarkArray-8                50000000                29.7 ns/op
BenchmarkArray2-8               100000000               22.0 ns/op
BenchmarkArray3-8               100000000               19.9 ns/op
BenchmarkArrayStruct-8          100000000               19.5 ns/op
BenchmarkArrayStruct3-8         200000000                6.91 ns/op

I tested a scenario using var acc [4]float vs var acc0, acc1, acc2, acc3 float64 and the 4-element array does not keep the accumulator values in registers.

In the following version, the accumulator values always write back to the stack in the loop

func SumFloat64Unroll_array(buf []float64) float64 {
	var (
		acc [4]float64
	)
	for i := 0; i < len(buf); i += 4 {
		bb := (*[4]float64)(unsafe.Pointer(&buf[i]))
		acc[0] += bb[0]
		acc[1] += bb[1]
		acc[2] += bb[2]
		acc[3] += bb[3]
	}
	return acc[0] + acc[1] + acc[2] + acc[3]
}

per

  movsd (CX)(DX*8), X0
  addsd "".acc(SP), X0
  movsd X0, "".acc(SP)
  movsd 8(CX)(DX*8), X0
  addsd "".acc+8(SP), X0
  movsd X0, "".acc+8(SP)
  movsd 16(CX)(DX*8), X0
  addsd "".acc+16(SP), X0
  movsd X0, "".acc+16(SP)
  movsd 24(CX)(DX*8), X0
  addsd "".acc+24(SP), X0
  movsd X0, "".acc+24(SP)

vs this version, each accumulator remains in a register for the loop

func SumFloat64Unroll_vars(buf []float64) float64 {
	var (
		acc0, acc1, acc2, acc3 float64
	)
	for i := 0; i < len(buf); i += 4 {
		bb := (*[4]float64)(unsafe.Pointer(&buf[i]))
		acc0 += bb[0]
		acc1 += bb[1]
		acc2 += bb[2]
		acc3 += bb[3]
	}
	return acc0 + acc1 + acc2 + acc3
}

per

  movsd (CX)(DX*8), X4
  addsd X4, X0
  movsd 8(CX)(DX*8), X4
  addsd X4, X1
  movsd 16(CX)(DX*8), X4
  addsd X4, X2
  movsd 24(CX)(DX*8), X4
  addsd X4, X3

Change https://golang.org/cl/106495 mentions this issue: cmd/compile: add some generic composite type optimizations