runtime: GoroutineProfile causes latency spikes in high-goroutine-count application

Question

runtime: GoroutineProfile causes latency spikes in high-goroutine-count application

dpolansky opened this issue 6 years ago · 5 comments

What version of Go are you using (`go version`)?

$ go version
go version devel +60f14fddfe Wed Jun 26 16:35:14 2019 +0000 darwin/amd64

Does this issue reproduce with the latest release?

The issue reproduces on go1.12.7 and go1.13beta1.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dpolans/Library/Caches/go-build"
GOENV="/Users/dpolans/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/dpolans/dev"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/usr/local/go/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cm/rp_0xvp15fn5mp4d672t2qsxkl2s83/T/go-build643112776=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I have a (production, latency-sensitive) Go service that has a median goroutine count of about 300,000 during steady state operations. The program periodically collects profiling data, including goroutine profiles.

What did you expect to see?

I expected the application’s performance behavior to be only slightly affected by collecting the goroutine profile (as is the case for collecting a heap profile, mutex contention profile, or CPU profile). I expected the magnitude of the latency impact to not scale with the number of goroutines (as the performance impact of heap profiling doesn’t change significantly with the size of the heap).

I expect some latency impact from a stop-the-world pause to prepare for the profile. I do not expect the duration of that pause to scale in proportion to the number of goroutines.

What did you see instead?

When Goroutine profiling is enabled, the 99th percentile latency for certain operations significantly increases. I have other programs that collect goroutine profiles with far fewer goroutines that do not incur the same latency spikes.

The benchmark attempts to replicate the latency spike by emulating the work done by the real application. The real application has one goroutine that periodically requests a goroutine profile while the other goroutines do some operation and report how long it took. The benchmark sets up a scenario with one goroutine that continuously collects goroutine profiles, n idle goroutines, and one goroutine that reports the duration of sleeping for 1 millisecond. The benchmark runs with several n values to demonstrate the latency impact as n grows. The results here show that when a goroutine profile is being collected, the p90, p99, and p99.9 wall-clock time the reproducer takes to sleep for 1ms increases with the number of live goroutines.

$ go test -bench=. -benchtime 1000x
goos: darwin
goarch: amd64
pkg: playground/runtime-slow-goroutineprofile
BenchmarkGoroutineProfile/1e5_other-8         	    1000	   1369474 ns/op	         4.97 ms-max	         0.477 ms-p90	         0.507 ms-p99	         0.532 ms-p99.9	      1001 work-count
BenchmarkGoroutineProfile/2e5_other-8         	    1000	   1376884 ns/op	         0.538 ms-max	         0.477 ms-p90	         0.516 ms-p99	         0.536 ms-p99.9	       999 work-count
BenchmarkGoroutineProfile/5e5_other-8         	    1000	   1365024 ns/op	         4.89 ms-max	         0.476 ms-p90	         0.507 ms-p99	         0.544 ms-p99.9	      1001 work-count
BenchmarkGoroutineProfile/1e6_other-8         	    1000	   1377884 ns/op	         3.91 ms-max	         0.478 ms-p90	         0.522 ms-p99	         0.900 ms-p99.9	      1002 work-count
BenchmarkGoroutineProfile/1e5_profile-8       	    1000	  83836454 ns/op	       344 ms-max	       148 ms-p90	       284 ms-p99	       338 ms-p99.9	       579 work-count
BenchmarkGoroutineProfile/2e5_profile-8       	    1000	 111897245 ns/op	       558 ms-max	       261 ms-p90	       342 ms-p99	       514 ms-p99.9	       430 work-count
BenchmarkGoroutineProfile/5e5_profile-8       	    1000	 238983513 ns/op	      1383 ms-max	       629 ms-p90	      1205 ms-p99	      1318 ms-p99.9	       379 work-count
BenchmarkGoroutineProfile/1e6_profile-8       	    1000	 129128664 ns/op	      2409 ms-max	      1086 ms-p90	      1234 ms-p99	      2405 ms-p99.9	       106 work-count
PASS
ok  	playground/runtime-slow-goroutineprofile	607.172s

Reproducer:

package issue

import (
	"context"
	"runtime"
	"sort"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

func BenchmarkGoroutineProfile(b *testing.B) {
	testcase := func(n int, profile bool) func(b *testing.B) {
		return func(b *testing.B) {
			var wg sync.WaitGroup
			ctx, cancel := context.WithCancel(context.Background())
			sr := make([]runtime.StackRecord, 2e6)

			// create bystanders
			for i := 0; i < n; i++ {
				wg.Add(1)
				go func() {
					<-ctx.Done()
					wg.Done()
				}()
			}
			// give the bystanders a moment to start up
			time.Sleep(1000 * time.Millisecond)

			// our villain
			var workCount uint64
			wg.Add(1)
			go func() {
				defer wg.Done()

				for ctx.Err() == nil {
					atomic.AddUint64(&workCount, 1)

					if profile {
						runtime.GoroutineProfile(sr) // this function iterates over runtime.allgs while all other work is stopped
					}
					time.Sleep(1 * time.Millisecond)
				}
			}()

			// record how long it takes to sleep for 1 millisecond b.N times
			times := make([]float64, b.N)
			sleep := 1 * time.Millisecond

			b.ResetTimer()
			prev := time.Now()
			for i := 0; i < b.N; i++ {
				time.Sleep(sleep)
				now := time.Now()
				times[i] = (now.Sub(prev) - sleep).Seconds() / time.Millisecond.Seconds()
				prev = now
			}
			b.StopTimer()

			cancel()
			wg.Wait()

			sort.Sort(sort.Reverse(sort.Float64Slice(times)))
			b.ReportMetric(times[len(times)/10], "ms-p90")
			b.ReportMetric(times[len(times)/100], "ms-p99")
			b.ReportMetric(times[len(times)/1000], "ms-p99.9")
			b.ReportMetric(times[0], "ms-max")
			b.ReportMetric(float64(workCount), "work-count")
		}
	}

	b.Run("1e5 other", testcase(1e5, false))
	b.Run("2e5 other", testcase(2e5, false))
	b.Run("5e5 other", testcase(5e5, false))
	b.Run("1e6 other", testcase(1e6, false))
	b.Run("1e5 profile", testcase(1e5, true))
	b.Run("2e5 profile", testcase(2e5, true))
	b.Run("5e5 profile", testcase(5e5, true))
	b.Run("1e6 profile", testcase(1e6, true))
}

CC @aclements @rhysh

Answer 1 · 2019-10-04T01:14:46.000Z

@aclements what do you think is the right approach here?

Today, runtime.GoroutineProfile stops the world and keeps it stopped while iterating over runtime.allgs.

Could it instead inspect only the _Grunning goroutines (via allp) while the world is stopped, set a flag that's checked on the transition into _Grunning, and iterate over the rest of allgs while the program executes?

What level of consistency is required in the goroutine profile? Currently it's completely consistent, since the world is stopped. The function's docs also promise that the result slice won't be modified unless it writes out the full profile. The gcount function usually gives an estimate; is it safe to call while the world is stopped, and will that result in an exact count?

The GC has a couple mechanisms for visiting all goroutines: stack scanning and stack shrinking. Extending those seems complex, but might lead to a low execution overhead. The simpler option I can think of involves adding a field to runtime.g to track if it's been included in the current profile, and adding a check to runtime.execute (the transition into _Grunning state / update of runtime.(*m).curg) to profile goroutines before they change their call stacks to be inconsistent with the rest of the profile. What level of complexity, and what level of overhead, would be acceptable in support of this fix?

Answer 2 · 2019-12-11T04:41:25.000Z

I also want to see GoroutineProfile scale. On the other hand, we use GoroutineProfile often when investigating issues involving inspections on locks and relationship among consumer/producer goroutines. It would be nice if we can still perform that kind of analysis with goroutine profile after this issue is addressed.

Answer 3 · 2022-02-22T20:42:50.000Z

Change https://go.dev/cl/387414 mentions this issue: runtime/pprof: stress test goroutine profiler

Answer 4 · 2022-02-22T20:43:24.000Z

Change https://go.dev/cl/387415 mentions this issue: runtime: decrease STW pause for goroutine profile

Answer 5 · 2022-02-22T20:43:26.000Z

Change https://go.dev/cl/387416 mentions this issue: runtime: check consistency of goroutine profile

What version of Go are you using (go version)?