runtime: GoroutineProfile causes latency spikes in high-goroutine-count application
dpolansky opened this issue · 5 comments
What version of Go are you using (go version
)?
$ go version go version devel +60f14fddfe Wed Jun 26 16:35:14 2019 +0000 darwin/amd64
Does this issue reproduce with the latest release?
The issue reproduces on go1.12.7 and go1.13beta1.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/dpolans/Library/Caches/go-build" GOENV="/Users/dpolans/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/dpolans/dev" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/usr/local/go/src/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cm/rp_0xvp15fn5mp4d672t2qsxkl2s83/T/go-build643112776=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
I have a (production, latency-sensitive) Go service that has a median goroutine count of about 300,000 during steady state operations. The program periodically collects profiling data, including goroutine profiles.
What did you expect to see?
I expected the application’s performance behavior to be only slightly affected by collecting the goroutine profile (as is the case for collecting a heap profile, mutex contention profile, or CPU profile). I expected the magnitude of the latency impact to not scale with the number of goroutines (as the performance impact of heap profiling doesn’t change significantly with the size of the heap).
I expect some latency impact from a stop-the-world pause to prepare for the profile. I do not expect the duration of that pause to scale in proportion to the number of goroutines.
What did you see instead?
When Goroutine profiling is enabled, the 99th percentile latency for certain operations significantly increases. I have other programs that collect goroutine profiles with far fewer goroutines that do not incur the same latency spikes.
The benchmark attempts to replicate the latency spike by emulating the work done by the real application. The real application has one goroutine that periodically requests a goroutine profile while the other goroutines do some operation and report how long it took. The benchmark sets up a scenario with one goroutine that continuously collects goroutine profiles, n idle goroutines, and one goroutine that reports the duration of sleeping for 1 millisecond. The benchmark runs with several n values to demonstrate the latency impact as n grows. The results here show that when a goroutine profile is being collected, the p90, p99, and p99.9 wall-clock time the reproducer takes to sleep for 1ms increases with the number of live goroutines.
$ go test -bench=. -benchtime 1000x
goos: darwin
goarch: amd64
pkg: playground/runtime-slow-goroutineprofile
BenchmarkGoroutineProfile/1e5_other-8 1000 1369474 ns/op 4.97 ms-max 0.477 ms-p90 0.507 ms-p99 0.532 ms-p99.9 1001 work-count
BenchmarkGoroutineProfile/2e5_other-8 1000 1376884 ns/op 0.538 ms-max 0.477 ms-p90 0.516 ms-p99 0.536 ms-p99.9 999 work-count
BenchmarkGoroutineProfile/5e5_other-8 1000 1365024 ns/op 4.89 ms-max 0.476 ms-p90 0.507 ms-p99 0.544 ms-p99.9 1001 work-count
BenchmarkGoroutineProfile/1e6_other-8 1000 1377884 ns/op 3.91 ms-max 0.478 ms-p90 0.522 ms-p99 0.900 ms-p99.9 1002 work-count
BenchmarkGoroutineProfile/1e5_profile-8 1000 83836454 ns/op 344 ms-max 148 ms-p90 284 ms-p99 338 ms-p99.9 579 work-count
BenchmarkGoroutineProfile/2e5_profile-8 1000 111897245 ns/op 558 ms-max 261 ms-p90 342 ms-p99 514 ms-p99.9 430 work-count
BenchmarkGoroutineProfile/5e5_profile-8 1000 238983513 ns/op 1383 ms-max 629 ms-p90 1205 ms-p99 1318 ms-p99.9 379 work-count
BenchmarkGoroutineProfile/1e6_profile-8 1000 129128664 ns/op 2409 ms-max 1086 ms-p90 1234 ms-p99 2405 ms-p99.9 106 work-count
PASS
ok playground/runtime-slow-goroutineprofile 607.172s
Reproducer:
package issue
import (
"context"
"runtime"
"sort"
"sync"
"sync/atomic"
"testing"
"time"
)
func BenchmarkGoroutineProfile(b *testing.B) {
testcase := func(n int, profile bool) func(b *testing.B) {
return func(b *testing.B) {
var wg sync.WaitGroup
ctx, cancel := context.WithCancel(context.Background())
sr := make([]runtime.StackRecord, 2e6)
// create bystanders
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
<-ctx.Done()
wg.Done()
}()
}
// give the bystanders a moment to start up
time.Sleep(1000 * time.Millisecond)
// our villain
var workCount uint64
wg.Add(1)
go func() {
defer wg.Done()
for ctx.Err() == nil {
atomic.AddUint64(&workCount, 1)
if profile {
runtime.GoroutineProfile(sr) // this function iterates over runtime.allgs while all other work is stopped
}
time.Sleep(1 * time.Millisecond)
}
}()
// record how long it takes to sleep for 1 millisecond b.N times
times := make([]float64, b.N)
sleep := 1 * time.Millisecond
b.ResetTimer()
prev := time.Now()
for i := 0; i < b.N; i++ {
time.Sleep(sleep)
now := time.Now()
times[i] = (now.Sub(prev) - sleep).Seconds() / time.Millisecond.Seconds()
prev = now
}
b.StopTimer()
cancel()
wg.Wait()
sort.Sort(sort.Reverse(sort.Float64Slice(times)))
b.ReportMetric(times[len(times)/10], "ms-p90")
b.ReportMetric(times[len(times)/100], "ms-p99")
b.ReportMetric(times[len(times)/1000], "ms-p99.9")
b.ReportMetric(times[0], "ms-max")
b.ReportMetric(float64(workCount), "work-count")
}
}
b.Run("1e5 other", testcase(1e5, false))
b.Run("2e5 other", testcase(2e5, false))
b.Run("5e5 other", testcase(5e5, false))
b.Run("1e6 other", testcase(1e6, false))
b.Run("1e5 profile", testcase(1e5, true))
b.Run("2e5 profile", testcase(2e5, true))
b.Run("5e5 profile", testcase(5e5, true))
b.Run("1e6 profile", testcase(1e6, true))
}
CC @aclements @rhysh
@aclements what do you think is the right approach here?
Today, runtime.GoroutineProfile
stops the world and keeps it stopped while iterating over runtime.allgs
.
Could it instead inspect only the _Grunning
goroutines (via allp
) while the world is stopped, set a flag that's checked on the transition into _Grunning
, and iterate over the rest of allgs
while the program executes?
What level of consistency is required in the goroutine profile? Currently it's completely consistent, since the world is stopped. The function's docs also promise that the result slice won't be modified unless it writes out the full profile. The gcount
function usually gives an estimate; is it safe to call while the world is stopped, and will that result in an exact count?
The GC has a couple mechanisms for visiting all goroutines: stack scanning and stack shrinking. Extending those seems complex, but might lead to a low execution overhead. The simpler option I can think of involves adding a field to runtime.g
to track if it's been included in the current profile, and adding a check to runtime.execute
(the transition into _Grunning
state / update of runtime.(*m).curg
) to profile goroutines before they change their call stacks to be inconsistent with the rest of the profile. What level of complexity, and what level of overhead, would be acceptable in support of this fix?
I also want to see GoroutineProfile scale. On the other hand, we use GoroutineProfile often when investigating issues involving inspections on locks and relationship among consumer/producer goroutines. It would be nice if we can still perform that kind of analysis with goroutine profile after this issue is addressed.
Change https://go.dev/cl/387414 mentions this issue: runtime/pprof: stress test goroutine profiler
Change https://go.dev/cl/387415 mentions this issue: runtime: decrease STW pause for goroutine profile
Change https://go.dev/cl/387416 mentions this issue: runtime: check consistency of goroutine profile