golang/go

cmd/pprof: pprof provides wrong address/line for CPU profiles

mpx opened this issue · 3 comments

mpx commented

Using CPU profiling and go tool pprof attributes CPU to the instruction following the one that consumed the time. This can cause CPU time to be attributed to an incorrect line number. For comparison, the Linux perf command works correctly. The example below shows the time is spent in the for loop, when it is actually spent during expensive memory accesses.

AFAICT this issue has existed for a long time (at least since 1.4). I can't see this issue has been logged yet which is surprising. It occurs with recent toolchains as well:

go version devel +d277a36123 Fri Sep 11 02:58:36 2020 +0000 linux/amd64
go version go1.15.2 linux/amd64

Tested on: Linux localhost.localdomain 5.8.4-200.fc32.x86_64 #1 SMP Wed Aug 26 22:28:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Here is a script which demonstrates the fault:

#!/bin/sh

cat > prof_test.go <<EOF
// Demonstrate that CPU profiles refer to the wrong instruction.
package prof

import "testing"

var buf [1024 * 1024 * 1024]byte

func Benchmark(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // Expensive memory accesses.
        buf[(2097169*i)%len(buf)] = 42
    }
}
EOF

go mod init prof
go test -c -o prof.test
perf record -- ./prof.test -test.bench . -test.benchtime 3s -test.cpuprofile cpu.prof
echo
echo '=== pprof list shows the "for" loop is expensive, not the memory access. ==='
go tool pprof -list Benchmark prof.test cpu.prof
echo
echo '=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ==='
go tool pprof -disasm Benchmark prof.test cpu.prof
echo
echo '=== Linux perf correctly shows the "MOV" instruction is expensive. ==='
perf annotate --stdio -s prof.Benchmark | cat

And the output:

$ ./run.sh 
goos: linux
goarch: amd64
pkg: prof
Benchmark-8   	314821470	        11.2 ns/op
PASS
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.995 MB perf.data (25644 samples) ]

=== pprof list shows the "for" loop is expensive, not the memory access. ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark in /home/mark/Prog/golang/src/junk/memperf/prof_test.go
     6.34s      6.34s (flat, cum)   100% of Total
         .          .      4:import "testing"
         .          .      5:
         .          .      6:var buf [1024 * 1024 * 1024]byte
         .          .      7:
         .          .      8:func Benchmark(b *testing.B) {
     5.52s      5.52s      9:    for i := 0; i < b.N; i++ {
         .          .     10:        // Expensive memory accesses.
     820ms      820ms     11:        buf[(2097169*i)%len(buf)] = 42
         .          .     12:    }
         .          .     13:}

=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark
     6.34s      6.34s (flat, cum)   100% of Total
         .          .     509500: SUBQ $0x18, SP                          ;prof_test.go:8
         .          .     509504: MOVQ BP, 0x10(SP)
         .          .     509509: LEAQ 0x10(SP), BP
         .          .     50950e: MOVQ 0x20(SP), DX                       ;prof_test.go:9
         .          .     509513: XORL AX, AX
         .          .     509515: JMP 0x509525
      30ms       30ms     509517: LEAQ prof.buf(SB), BX                   ;prof.Benchmark prof_test.go:11
     540ms      540ms     50951e: MOVB $0x2a, 0(BX)(SI*1)
     5.49s      5.49s     509522: INCQ AX                                 ;prof.Benchmark prof_test.go:9
      30ms       30ms     509525: CMPQ AX, 0x170(DX)
         .          .     50952c: JLE 0x509559                            ;prof_test.go:9
      10ms       10ms     50952e: IMULQ $0x200011, AX, BX                 ;prof.Benchmark prof_test.go:11
      20ms       20ms     509535: MOVQ BX, SI
      80ms       80ms     509538: SARQ $0x3f, BX
         .          .     50953c: SHRQ $0x22, BX                          ;prof_test.go:11
         .          .     509540: ADDQ SI, BX
      30ms       30ms     509543: SARQ $0x1e, BX                          ;prof.Benchmark prof_test.go:11
      90ms       90ms     509547: SHLQ $0x1e, BX
         .          .     50954b: SUBQ BX, SI                             ;prof_test.go:11
      20ms       20ms     50954e: CMPQ $prof.buf+1067102432(SB), SI       ;prof.Benchmark prof_test.go:11
         .          .     509555: JB 0x509517                             ;prof_test.go:11
         .          .     509557: JMP 0x509563
         .          .     509559: MOVQ 0x10(SP), BP                       ;prof_test.go:9
         .          .     50955e: ADDQ $0x18, SP
         .          .     509562: RET
         .          .     509563: MOVQ SI, AX                             ;prof_test.go:11
         .          .     509566: MOVL $prof.buf+1067102432(SB), CX
         .          .     50956b: CALL runtime.panicIndex(SB)

=== Linux perf correctly shows the "MOV" instruction is expensive. ===
 Percent |	Source code & Disassembly of prof.test for cycles:u (24257 samples, percent: local period)
----------------------------------------------------------------------------------------------------------
         :
         :
         :
         :            Disassembly of section .text:
         :
         :            0000000000509500 <prof.Benchmark>:
         :            prof.Benchmark():
         :
         :            import "testing"
         :
         :            var buf [1024 * 1024 * 1024]byte
         :
         :            func Benchmark(b *testing.B) {
    0.00 :   509500: sub    $0x18,%rsp
    0.00 :   509504: mov    %rbp,0x10(%rsp)
    0.00 :   509509: lea    0x10(%rsp),%rbp
         :            for i := 0; i < b.N; i++ {
    0.00 :   50950e: mov    0x20(%rsp),%rdx
    0.00 :   509513: xor    %eax,%eax
    0.00 :   509515: jmp    509525 <prof.Benchmark+0x25>
         :            // Expensive memory accesses.
         :            buf[(2097169*i)%len(buf)] = 42
    1.82 :   509517: lea    0x14ba02(%rip),%rbx        # 654f20 <prof.buf>
   92.32 :   50951e: movb   $0x2a,(%rbx,%rsi,1)
         :            for i := 0; i < b.N; i++ {
    0.08 :   509522: inc    %rax
    0.25 :   509525: cmp    %rax,0x170(%rdx)
    0.00 :   50952c: jle    509559 <prof.Benchmark+0x59>
         :            buf[(2097169*i)%len(buf)] = 42
    0.53 :   50952e: imul   $0x200011,%rax,%rbx
    1.74 :   509535: mov    %rbx,%rsi
    0.02 :   509538: sar    $0x3f,%rbx
    0.11 :   50953c: shr    $0x22,%rbx
    0.46 :   509540: add    %rsi,%rbx
    1.93 :   509543: sar    $0x1e,%rbx
    0.03 :   509547: shl    $0x1e,%rbx
    0.12 :   50954b: sub    %rbx,%rsi
    0.00 :   50954e: cmp    $0x40000000,%rsi
    0.61 :   509555: jb     509517 <prof.Benchmark+0x17>
    0.00 :   509557: jmp    509563 <prof.Benchmark+0x63>
         :            for i := 0; i < b.N; i++ {
    0.00 :   509559: mov    0x10(%rsp),%rbp
    0.00 :   50955e: add    $0x18,%rsp
    0.00 :   509562: retq
         :            buf[(2097169*i)%len(buf)] = 42
    0.00 :   509563: mov    %rsi,%rax
    0.00 :   509566: mov    $0x40000000,%ecx
    0.00 :   50956b: callq  46aae0 <runtime.panicIndex>
         :            ./<autogenerated>:1
    0.00 :   509570: nop

Cc @hyangah @randall77

This sounds like it is likely due to the proposal in #36821. Go's pprof profiling uses ITIMER_PROF, which can have skid of several instructions from the actual expensive instruction.

Linux's perf, on the other hand, can use Intel PEBS (/AMD equivalent) for precise sampling.

I don't think perf enables PEBS by default (-e cycles:pp is required), but even the standard hardware counter likely has less skid than ITIMER_PROF.

/cc @rhysh too.

As I noted in #30708, we cannot correct the skid in Go except to use precise hardware counters, which #36821 would enable.

Regardless of the outcome of that proposal (not all systems will have precise counters), we could use better documentation about this limitation.