golang/go

runtime: defer is slow

minux opened this issue · 42 comments

minux commented
package p
import "testing"
//go:noinline
func defers() (r int) {
        defer func() {
                r = 42
        }()
        return 0
}
func BenchmarkDefer(b *testing.B) {
        for i := 0; i < b.N; i++ {
                defers()
        }
}

On my system, BenchmarkDefer uses 77.7ns/op. This issue
arises from investigation of #9704: if I remove the "defer endcgo(mp)"
and place the call at the end of the func cgocall, the benchmark in
#9704 will improve from 144ns/op to 63.7ns/op. (Note: we can't

eliminate the defer in func cgocall though, as it will break defer/recover
in Go->C->Go scenario.)

minux commented

One way to improve defer for such simple cases is to
allocate _defer on stack by the compiler. But as
a function can defer unbounded number of functions,
I'm not sure if maintaining two different _defer allocation
mechanisms is acceptable.

Some prior discussion at #6980.

I also had a CL recently where a single defer would have been the preferred solution but was not usable because of a 65ns performance hit. CL20512

So improving simple cases like a single defer in some branch (might even call no other methods or only select few) would have helped there.

name old time/op new time/op delta 
SprintfInt-2 137ns ± 7% 202ns ± 6% +47.72% (p=0.000 n=20+20)

Given the current cost of defers, I think it's acceptable to have two defer allocation mechanisms if that addresses the problem. And I actually don't think the runtime side of this is very complicated.

This is on the list for 1.8. I'm planning to either do it myself or get someone else to do it. :)

If it turns out we need to simplify things, we could limit this to defers with at most one (possibly implicit) argument, which would handle the cgo case as well as the mutex unlock case. Another possible simplification would be to only stack-allocate defers in functions where all defers can be stack allocated, which would probably simplify creating an efficient prologue.

Separate from stack allocation, we should also consider special-casing defers with no arguments, as that is a fairly common case (about half of the defer calls in the standard library). Because the no-argument case doesn't have to worry about the arguments on the stack, it can use a simpler version of deferproc, one that doesn't need to call systemstack.

@bmizerany sent https://golang.org/cl/29379 to replace a defer with explicit mutex unlocks in x/time/rate. Consider reverting that optimization if this more general optimization goes in.

/cc @dr2chase, who I've been talking with about compiler changes to support this.

@dr2chase and I have been discussing an alternate approach to this that's somewhat less general than stack-allocated defers but should have essentially zero overhead versus a function call when it applies.

The idea is to take a page out of the C++ exception handling book, open code the defer execution, and use a PC value table to figure out where the defer code is when handling a panic.

Specifically, if the set of defers can be statically determined at every PC in a function, then the compiler would turn those defers into closures built on the stack and generate code at every exit point to directly call the appropriate defer closures for that exit point. In the common case, then, there would be no deferreturn logic at all and the defer execution would look essentially like hand-expanding the defer (like what CL 29379 did).

To keep this working with panics, the compiler would generate a PC value table for every function where this optimization applies that logically records, for every PC, where in the stack frame to find the set of defer closures to run for that PC. This actual encoding of this table could be quite compact in general, since large runs of PCs will have the same set of defer closures, and we could encode the tree-like structure of the set of defers to run directly in this table, so each entry would contain at most one defer closure offset and the PC to use to look up the next defer closure offset.

When panic walks the stack, it would keep an eye on both this PC value table and the defer stack. A given frame could have either defers on the stack or a defer offset PC value table, but not both. If a frame has a defer offset PC value table, panic would use the table to find the defer closures and call them.

minux commented

I don't quite see how your proposal would correctly handle the case of a panic that occurs while executing a deferred function. At that point some of the deferred functions have been executed and some have not, but what is the PC value you will use to determine which remain to be executed?

A similar approach that might work would be to record a linked list of stack allocated deferred closures in the stack frame. You could even choose to always start the list from the same place in the stack frame, just below the return address/frame pointer, so you would only need a single bit in the traceback information. Then each time you defer a function, you update the linked list to point to the new stack allocated closure. At the end of the function, you remove each closure from the list as you execute it.

@minux - no. If there's any point in the generated code (which is not the same as the input program) that two different defer sets can reach, then the PC-range technique won't work. Unless, say, we record which defers need running by storing into a bitmask of defers-to-(not-)run; this would also handles Ian's problem. Store a zero byte on entry, store a different value on the branches to the common area that don't enable all the defers. PC range has to mention the location of the which-defers-not-to-run byte, of course.

I don't quite see how your proposal would correctly handle the case of a panic that occurs while executing a deferred function. At that point some of the deferred functions have been executed and some have not, but what is the PC value you will use to determine which remain to be executed?

That's a good question. If you're simply in the function epilogue executing local defers and one of them panics, then the PC in the epilogue tells you what's left. However, if you're running defers because of a panic and one of them panics, then you're right that things get more complicated. One possibility would be that when a panic happens, before running any defers, you walk the stack to find any frames that have PC value-based defers and weave those into the linked list of defers at the right points. Then panic just works from the linked list. If a second panic happens while walking this linked list, we know that we're already handling a panic, so it wouldn't rebuild the list, it would just keep going from the current list.

A similar approach that might work would be to record a linked list of stack allocated deferred closures in the stack frame.

Possibly. I was hoping to avoid the overhead of even dealing with a linked list in the regular return case. If I understand your proposal, it seems like the cost at function exit wouldn't be substantially different from the cost of deferreturn today, which alone is about 36% of the time in @minux's benchmark.

I am imagining that given

defer f()
...
defer g()

the defer statements would be compiled as

deferredFunctions = deferredFunction{next: deferredFunctions; run: f}
...
deferredFunctions = deferredFunction{next: deferredFunctions; run: g}

and the function exit sequence would be compiled as

deferredFunctions = deferredFunctions.next
g()
deferredFunctions = deferredFunctions.next
f()

So I don't think the performance would be like that of deferreturn. The compiler would of course have to ensure that deferredFunctions was stored on the stack and that it was always accurate before any function call. And this approach would take extra execution space as there would in effect be two copies of each deferred function--I suspect that is true of any efficient implementation.

I see. I thought you were saying the function return would use the linked list (rather than just unwinding it for the benefit of a panic). I imagine that would perform well.

I analyzed the performance of the following benchmark to understand better where the costs really are:

package main

import (
    "sync"
    "testing"
)

var mu sync.Mutex

//go:noinline
func deferLock() {
    mu.Lock()
    defer mu.Unlock()
}

//go:noinline
func noDeferLock() {
    mu.Lock()
    mu.Unlock()
}

func BenchmarkDeferLock(b *testing.B) {
    for i := 0; i < b.N; i++ {
        deferLock()
    }
}

func BenchmarkNoDeferLock(b *testing.B) {
    for i := 0; i < b.N; i++ {
        noDeferLock()
    }
}

This is similar to @minux's benchmark, but slightly more realistic. The results on my laptop are:

BenchmarkDeferLock-4        200000000           83.7 ns/op
BenchmarkNoDeferLock-4      1000000000          16.1 ns/op

That is, the defer-based benchmark is 5.2X more expensive. The 83.7 ns in BenchmarkDeferLock breaks down as follows:

deferproc 23ns
  allocate defer 10ns

deferreturn 22ns
  free defer 9ns
  jump defer 4.8ns

Surprisingly, relatively little time is spent actually allocating or freeing defers. Most of the time is spread out across bookkeeping and shuffling frames around. I can't tell where the remaining 23ns of difference are going.

CL https://golang.org/cl/29655 mentions this issue.

CL https://golang.org/cl/29656 mentions this issue.

We've doubled the performance of defer for Go 1.8, but it could still be faster, so I'm moving this to Go 1.9.

Notably defer + lambda + method call works faster than defer + method call.
I wrote a simple program that compares performance of

defer mutex.Unlock()

and

defer func() { mutex.Unlock() }()

The results are as follows for the master branch (ee392ac):

With lambda: 46.691357 ns
Without lambda: 50.014426 ns

My explanation. Probably defer mutex.Unlock() works slower, because it has to put the argument (mutex) on the stack. In case of lambda function mutex is not an argument - it is a reference to an upvalue, so defer doesn't need to put it on stack.

The speedup of using lambda version is 1.07. Can Go compiler detect cases where defer's arguments do not change (like mutex in my code) and introduce a lambda function to get that speedup?

PS. Also I run the program under Go 1.8.3 and got the following output:

With lambda: 45.962607 ns
Without lambda: 48.770406 ns

It demonstrates almost the same speedup of using lambdas (1.06) but more important part is that the master works slower than 1.8.3. Shouldn't it have worked faster after https://go-review.googlesource.com/c/29656/ ?

@starius, interesting case.

The speedup of using lambda version is 1.07. Can Go compiler detect cases where defer's arguments do not change (like mutex in my code) and introduce a lambda function to get that speedup?

As you point out, it's important that mutex doesn't change, or else this transformation isn't actually sound. Unfortunately, while that sort of thing is easy to detect in SSA, defer is mostly implemented in the front-end, where it's much harder to tell.

For the defer optimization I'd like to ultimately do, though, I suspect this would fall out naturally, since the defer call would simply be transformed into a function call at the end of the function (plus some panic handling metadata), so SSA would recognize that the argument saved at the defer point is unchanged at the prologue and CSE them.

It demonstrates almost the same speedup of using lambdas (1.06) but more important part is that the master works slower than 1.8.3. Shouldn't it have worked faster after https://go-review.googlesource.com/c/29656/ ?

I suspect this is noise. CL 29656 was released in Go 1.8, and the defer implementation hasn't changed significantly (perhaps at all) between 1.8 and master.

CAFxX commented

As you point out, it's important that mutex doesn't change, or else this transformation isn't actually sound.

Wouldn't it be sound if the transformation yielded the following instead?

{
  receiver := mutex
  defer func() { receiver.Unlock() }()
}

With check and handle being on the way, but handle overlapping with defer in semantic and likely not being merged at the time when check will be merged, defer will be increasingly used more, for example like this (2).

defer is also very handy with things like mutex.Unlock() and, even if there are multiple defers in one function, in practice they are seldomly used outside of main function control flow. That means that the case when all defers in a function are setup unconditionally is common. In turn that means that a PC-table based approach - to know which defers to invoke on panic - could be, in practice, used for most functions.

This issue is hanging with us for some years already. Given that, in many cases, using defer turns out to be prohibitevely expensive, please give it a priority in Go1.13.

Thanks beforehand,
Kirill

FWIW, at this point the only time I avoid defer is when I’m writing hot, tiny routines, usually involving a mutex. Open coding defers in “easy” cases only would handle that.

@aclements, thanks for feedback. A common example for defer to be expensive is short/medium functions which need to lock/unlock. Look e.g. at 0b9c1ad2 (#22401) and lots of mentions in this issue where people have to avoid defer for performance reasons. Among developers it is now somewhat commonplace culture to avoid defer, because it has the reputation to be slow. Which is a pity, because defer is a good language construct and makes the code more clear.

I understand everyone is busy and are not "resting on laurels". What I'm only asking for is to consider raising the priority of fixing defer performance badness during Go1.13 cycle.

The case to fix is when all defers in a function are invoked unconditionally, and so could be appended to function tail as regular calls by code generator + PC→<which defers to call> table for panic.

Thanks again,
Kirill

(@josharian thanks for feedback too)

rhysh commented

If we had more data on this, perhaps we could focus on those cases.

I'm working on a high-throughput HTTPS server which shows the cost of defer in crypto/tls and internal/poll fairly clearly in its CPU profiles. The following five defer sites account for several percent of the application's CPU usage, split fairly evenly between each.

These particular defers are in place unconditionally, so a minimal PC-range-based compiler change as @dr2chase and @aclements discussed in September 2016 might be enough.

Change https://golang.org/cl/171758 mentions this issue: cmd/compile,runtime: allocate defer records on the stack

@randall77, thanks a lot for fff4f59.

Change https://golang.org/cl/190098 mentions this issue: cmd/compile, cmd/link, runtime: make defers low-cost through inline code and extra funcdata

If we had more data on this, perhaps we could focus on those cases.

I'm working on a high-throughput HTTPS server which shows the cost of defer in crypto/tls and internal/poll fairly clearly in its CPU profiles. The following five defer sites account for several percent of the application's CPU usage, split fairly evenly between each.

These particular defers are in place unconditionally, so a minimal PC-range-based compiler change as @dr2chase and @aclements discussed in September 2016 might be enough.

@rhysh Go 1.1.3 improved defer performance already by allocating defer records on the stack. And we have just checked into the main tree (for release in Go 1.14) a bigger change https://golang.org/cl/190098 to make defer calls directly (inline) at normal exits. This should reduce overhead even more significantly in many cases.

So, if you are still seeing defer overheads for your server, it will be great to see if the overheads have gone down with Go 1.13 (if you haven't already upgraded) or with the changes in the main tree (or with the beta release of Go 1.14 in early November).

Change https://golang.org/cl/202340 mentions this issue: cmd/compile, cmd/link, runtime: make defers low-cost through inline code and extra funcdata

rhysh commented

@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of defer's direct CPU cost in the application I described. A small change to crypto/tls would fix the remaining case.

For go1.12.12, go1.13.3, and be64a19, I counted profile samples that have crypto/tls.(*Conn).Write on the stack (with -focus=tls...Conn..Write) and profile samples that additionally have defer-related functions on the stack (saving the prior results and adding on -focus='^runtime\.(deferproc|deferreturn|jmpdefer)$'). With go1.12.12, about 2.5% of time in crypto/tls.(*Conn).Write is spent on defer. With go1.13.3, it's about 1.5%. With the development branch at be64a19, it's down to 0.5%.

Zooming out to the application's total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer "in" a loop in crypto/tls.(*Conn).Write. If that call were outside of the loop—or if the compiler could prove that it's effectively outside the loop already—then CL 202340 would all but eliminate the CPU cost of defer for the application (down to roughly 0.01%).

Thank you!

Change https://golang.org/cl/203481 mentions this issue: crypto/tls: move a defer out of a loop

rasky commented

How’s that defer “in a loop”? It does look like it from a source code point of view, but in SSA it is in a block which is no more “part of the loop” than the first block after loop in source code order. It sounds like the loop detection made by the defer optimization is slightly off.

@rasky the loop detection occurs during escape analysis, not SSA.

rasky commented

Can we close this issue? It doesn't seem there's much left. Defers in a loop are always going to be somewhat slower.

@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of defer's direct CPU cost in the application I described. A small change to crypto/tls would fix the remaining case.

For go1.12.12, go1.13.3, and be64a19, I counted profile samples that have crypto/tls.(*Conn).Write on the stack (with -focus=tls...Conn..Write) and profile samples that additionally have defer-related functions on the stack (saving the prior results and adding on -focus='^runtime\.(deferproc|deferreturn|jmpdefer)$'). With go1.12.12, about 2.5% of time in crypto/tls.(*Conn).Write is spent on defer. With go1.13.3, it's about 1.5%. With the development branch at be64a19, it's down to 0.5%.

Zooming out to the application's total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer "in" a loop in crypto/tls.(*Conn).Write. If that call were outside of the loop—or if the compiler could prove that it's effectively outside the loop already—then CL 202340 would all but eliminate the CPU cost of defer for the application (down to roughly 0.01%).

Thank you!

@rhysh Glad to hear the results that you have measured! And, as a bunch of folks have already pointed out, there are a bunch of further optimizations that we can do to tighten up the inline defer code, eliminate some of the checks, etc.

@danscales, thanks a lot for fixing defer slowness!

I think this is done, so I am going to close it. We can always make things faster. If there are specific ideas for making defers faster, let's do them in separate issues.

Thanks to all especially @danscales .