cmd/pprof: pprof provides wrong address/line for CPU profiles
mpx opened this issue · 3 comments
Using CPU profiling and go tool pprof
attributes CPU to the instruction following the one that consumed the time. This can cause CPU time to be attributed to an incorrect line number. For comparison, the Linux perf
command works correctly. The example below shows the time is spent in the for
loop, when it is actually spent during expensive memory accesses.
AFAICT this issue has existed for a long time (at least since 1.4). I can't see this issue has been logged yet which is surprising. It occurs with recent toolchains as well:
go version devel +d277a36123 Fri Sep 11 02:58:36 2020 +0000 linux/amd64
go version go1.15.2 linux/amd64
Tested on: Linux localhost.localdomain 5.8.4-200.fc32.x86_64 #1 SMP Wed Aug 26 22:28:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Here is a script which demonstrates the fault:
#!/bin/sh
cat > prof_test.go <<EOF
// Demonstrate that CPU profiles refer to the wrong instruction.
package prof
import "testing"
var buf [1024 * 1024 * 1024]byte
func Benchmark(b *testing.B) {
for i := 0; i < b.N; i++ {
// Expensive memory accesses.
buf[(2097169*i)%len(buf)] = 42
}
}
EOF
go mod init prof
go test -c -o prof.test
perf record -- ./prof.test -test.bench . -test.benchtime 3s -test.cpuprofile cpu.prof
echo
echo '=== pprof list shows the "for" loop is expensive, not the memory access. ==='
go tool pprof -list Benchmark prof.test cpu.prof
echo
echo '=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ==='
go tool pprof -disasm Benchmark prof.test cpu.prof
echo
echo '=== Linux perf correctly shows the "MOV" instruction is expensive. ==='
perf annotate --stdio -s prof.Benchmark | cat
And the output:
$ ./run.sh
goos: linux
goarch: amd64
pkg: prof
Benchmark-8 314821470 11.2 ns/op
PASS
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.995 MB perf.data (25644 samples) ]
=== pprof list shows the "for" loop is expensive, not the memory access. ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark in /home/mark/Prog/golang/src/junk/memperf/prof_test.go
6.34s 6.34s (flat, cum) 100% of Total
. . 4:import "testing"
. . 5:
. . 6:var buf [1024 * 1024 * 1024]byte
. . 7:
. . 8:func Benchmark(b *testing.B) {
5.52s 5.52s 9: for i := 0; i < b.N; i++ {
. . 10: // Expensive memory accesses.
820ms 820ms 11: buf[(2097169*i)%len(buf)] = 42
. . 12: }
. . 13:}
=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark
6.34s 6.34s (flat, cum) 100% of Total
. . 509500: SUBQ $0x18, SP ;prof_test.go:8
. . 509504: MOVQ BP, 0x10(SP)
. . 509509: LEAQ 0x10(SP), BP
. . 50950e: MOVQ 0x20(SP), DX ;prof_test.go:9
. . 509513: XORL AX, AX
. . 509515: JMP 0x509525
30ms 30ms 509517: LEAQ prof.buf(SB), BX ;prof.Benchmark prof_test.go:11
540ms 540ms 50951e: MOVB $0x2a, 0(BX)(SI*1)
5.49s 5.49s 509522: INCQ AX ;prof.Benchmark prof_test.go:9
30ms 30ms 509525: CMPQ AX, 0x170(DX)
. . 50952c: JLE 0x509559 ;prof_test.go:9
10ms 10ms 50952e: IMULQ $0x200011, AX, BX ;prof.Benchmark prof_test.go:11
20ms 20ms 509535: MOVQ BX, SI
80ms 80ms 509538: SARQ $0x3f, BX
. . 50953c: SHRQ $0x22, BX ;prof_test.go:11
. . 509540: ADDQ SI, BX
30ms 30ms 509543: SARQ $0x1e, BX ;prof.Benchmark prof_test.go:11
90ms 90ms 509547: SHLQ $0x1e, BX
. . 50954b: SUBQ BX, SI ;prof_test.go:11
20ms 20ms 50954e: CMPQ $prof.buf+1067102432(SB), SI ;prof.Benchmark prof_test.go:11
. . 509555: JB 0x509517 ;prof_test.go:11
. . 509557: JMP 0x509563
. . 509559: MOVQ 0x10(SP), BP ;prof_test.go:9
. . 50955e: ADDQ $0x18, SP
. . 509562: RET
. . 509563: MOVQ SI, AX ;prof_test.go:11
. . 509566: MOVL $prof.buf+1067102432(SB), CX
. . 50956b: CALL runtime.panicIndex(SB)
=== Linux perf correctly shows the "MOV" instruction is expensive. ===
Percent | Source code & Disassembly of prof.test for cycles:u (24257 samples, percent: local period)
----------------------------------------------------------------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 0000000000509500 <prof.Benchmark>:
: prof.Benchmark():
:
: import "testing"
:
: var buf [1024 * 1024 * 1024]byte
:
: func Benchmark(b *testing.B) {
0.00 : 509500: sub $0x18,%rsp
0.00 : 509504: mov %rbp,0x10(%rsp)
0.00 : 509509: lea 0x10(%rsp),%rbp
: for i := 0; i < b.N; i++ {
0.00 : 50950e: mov 0x20(%rsp),%rdx
0.00 : 509513: xor %eax,%eax
0.00 : 509515: jmp 509525 <prof.Benchmark+0x25>
: // Expensive memory accesses.
: buf[(2097169*i)%len(buf)] = 42
1.82 : 509517: lea 0x14ba02(%rip),%rbx # 654f20 <prof.buf>
92.32 : 50951e: movb $0x2a,(%rbx,%rsi,1)
: for i := 0; i < b.N; i++ {
0.08 : 509522: inc %rax
0.25 : 509525: cmp %rax,0x170(%rdx)
0.00 : 50952c: jle 509559 <prof.Benchmark+0x59>
: buf[(2097169*i)%len(buf)] = 42
0.53 : 50952e: imul $0x200011,%rax,%rbx
1.74 : 509535: mov %rbx,%rsi
0.02 : 509538: sar $0x3f,%rbx
0.11 : 50953c: shr $0x22,%rbx
0.46 : 509540: add %rsi,%rbx
1.93 : 509543: sar $0x1e,%rbx
0.03 : 509547: shl $0x1e,%rbx
0.12 : 50954b: sub %rbx,%rsi
0.00 : 50954e: cmp $0x40000000,%rsi
0.61 : 509555: jb 509517 <prof.Benchmark+0x17>
0.00 : 509557: jmp 509563 <prof.Benchmark+0x63>
: for i := 0; i < b.N; i++ {
0.00 : 509559: mov 0x10(%rsp),%rbp
0.00 : 50955e: add $0x18,%rsp
0.00 : 509562: retq
: buf[(2097169*i)%len(buf)] = 42
0.00 : 509563: mov %rsi,%rax
0.00 : 509566: mov $0x40000000,%ecx
0.00 : 50956b: callq 46aae0 <runtime.panicIndex>
: ./<autogenerated>:1
0.00 : 509570: nop
This sounds like it is likely due to the proposal in #36821. Go's pprof profiling uses ITIMER_PROF, which can have skid of several instructions from the actual expensive instruction.
Linux's perf, on the other hand, can use Intel PEBS (/AMD equivalent) for precise sampling.
I don't think perf enables PEBS by default (-e cycles:pp
is required), but even the standard hardware counter likely has less skid than ITIMER_PROF.