golang/go

cmd/compile: enable mid-stack inlining

davidlazar opened this issue ยท 61 comments

CL https://golang.org/cl/37231 mentions this issue.

CL https://golang.org/cl/37233 mentions this issue.

This is awesome.

I probably missed some discussion, but is there a design doc or proposal doc I can look at?

Out of curiosity, is there a plan to emit this information as part of DWARF? It would be a nice feature if debuggers can access the InlTree info (right now they can't print correct backtraces for inlined calls; I confirmed with 781fd39).

There is an outdated proposal doc. I'll update and publish it this week. In the meantime, these slides give an overview of the approach: https://golang.org/s/go19inliningtalk

I haven't looked at the DWARF yet, but the plan is to add inlining info to the DWARF tables before we turn on mid-stack inlining for 1.9.

CL https://golang.org/cl/37854 mentions this issue.

CL https://golang.org/cl/38090 mentions this issue.

rsc commented

It seems clear we're going to do this, assuming the right tuning (not yet done!). The tuning itself doesn't have to go through the proposal process. Accepting proposal.

It seems that func Caller(skip int) in runtime/extern.go also needs to be updated for this change, as it currently calls findfunc(pc), similarly to FuncForPC.

Indeed. I have a CL that updates runtime.Caller but haven't mailed it out yet.

CL https://golang.org/cl/40270 mentions this issue.

Is -l=4 going to be the default for Go 1.9?

Not yet, it has high compilation costs, largely because we need to be much pickier about how we read export data.

Is -l=4 going to be the default for Go 1.9?

No.

Maybe in Go 1.10.

Is the recommendation then to use "-l=4" in Go 1.9 for production builds where runtime performance is ideal?

rsc commented

No, -l=4 is explicitly untested and unsupported for production use. If you do that and you get programs that break, you get to keep both pieces.

Change https://golang.org/cl/74110 mentions this issue: cmd/compile: don't export unreachable inline method bodies

I guess we're not going to enable this by default for Go 1.10? @aclements

Hi, is this still in plan for Go 1.11? Its now been about 3 months since it was punted to go 1.11, and about a year since this was published and prototyped. It would be nice to get this in for go 1.11. At the minimum, it makes reflection faster (all those reflect methods that have panic making them non-inlineable will now be inlined) which makes many heavily-used things faster (printing via fmt, json encoding, etc) and will cause a significant jump in performance for most libraries by eliding the function overhead for delegate functions, etc.

I know there was talk of an export format changes blocking this. Is this done yet?

Thanks. I am writing this because it's in my list of things I am excited about for go 1.11, along with faster defer and support for co-operative coroutines (i.e. scheduler optimizing case where 2 goroutines serve as producer and consumer on a chan and can be scheduled "together" instead of each send/receive doing round-robin over all goroutines). Rust is also getting extremely compelling by September 2018, and it would be nice that performance is "comparable".

As they say, optimizations drive people to code the right way, as they don't see a loss. Without mid-stack inlining, I have written code where I have "manually inlined" functions to get better performance in my library, or I never use defer in my libraries because of the performance hit. That's the kind of mental overload that I would like to avoid.

@aclements @rsc @davidlazar @cespare @dr2chase

We'd like to get this done for 1.11.

I think the needed export format changes are done. I think Matthew has some more changes lined up to make things better (compile-time faster), but at this point they aren't blockers.

The major TODO at this point is to tune the inlining heuristic. Mid-stack inlining helps runtime speed, but it can make binaries bigger. A lot bigger in some cases; cmd/compile's text segment gets ~100% bigger. I don't think that's launchable as-is, so we need to figure out the right way to tweak the heuristics to preserve as much speed as we can while keeping binary size manageable. Ideas welcome; there's no obvious plan of attack here.

Yes, we'd definitely like to get rid of all the situations where people have had to manually inline things.

mvdan commented

Has any thought been given to enabling a conservative version of mid-stack inlining in 1.11? That is, only doing the extra inlining where it means little or none increment in binary size.

@randall77 would you consider having a conservative version, as @mvdan suggested, but allowing users to also experiment with this by providing a compiler directive, like //go:inline, which could perform the inline but only up to a max defined complexity?

@mvdan: That's an option. It's not trivial to do, though, as at inlining decision time we don't know what the final binary size difference is going to end up being. We have to more or less guess based on the info we do have (# and kind of AST nodes).

@dlsniper: I'd like to avoid a //go:inline comment if we can. I don't think it solves the problem well, as the inlining decision should probably depend on characteristics of the call site (e.g. in a loop, constant arguments, etc.), not just the function being called.

mvdan commented

I was thinking conservative in terms of the heuristic. For example, every extra level of inlining could increase the cost of the function by a constant, or by a percent.

I assume that this will come down to lots of testing and gathering of data, though. I'm not sure how useful it is to throw ideas at this issue before then :)

CAFxX commented

Maybe a silly idea but... What if, at least for this first version, mid-stack in lining was enabled only for functions that are transitively statically reachable from a benchmark function in the same package? Would be nice to extend this to use actual profiling information in the future.

CAFxX commented

@randall77 what if the //go:inline was used to mark the callsite instead?

Transitively reachable from a benchmark function sounds problematic. When doing a non-test build, the compiler never sees _test.go files, which is where all the Benchmark functions tend to be. And having the existence of a Benchmark function affect the performance of the function being benchmarked sounds like a recipe for HeisenBugs. On the plus side, though, it would encourage the writing of Benchmark functions.

The compiler has no support for //go: directives at statement or expression scope, only at global scope. Not that it couldn't be added, but it's significant work.

Is there a way to track the inlined code such that #14840 would then be useful and eliminate more deadcode? The inlining process is going to touch the linker anyway, might as well make it useful?

@chewxy The inlining process does not involve the linker. The compiler does ~all the work.

The inlining process will detect and remove dead code. If that ends up removing the last reference to a global, that global will be removed by the linker. But I don't think that will help with #14840, which is about globals with init functions.

CAFxX commented

The major TODO at this point is to tune the inlining heuristic. Mid-stack inlining helps runtime speed, but it can make binaries bigger. A lot bigger in some cases; cmd/compile's text segment gets ~100% bigger. I don't think that's launchable as-is, so we need to figure out the right way to tweak the heuristics to preserve as much speed as we can while keeping binary size manageable. Ideas welcome; there's no obvious plan of attack here.

Silly idea # 2: how about brute forcing this? ๐Ÿ’ช

  • Grab an intern ๐Ÿฅ‡
  • Gather a corpus of Go code with (macro?)benchmarks
  • For each benchmark measure speed (+allocations?) and text size with inlining disabled (baseline)
  • For each benchmark measure the same as above, with "random" inlining decisions in the functions that are transitively called by it; have the compiler log those decisions (repeat this step many times to generate many measures)
  • Run some fancy ML method on the corpus of inlining decisions and benchmark results (relative to the baseline) to identify a set of inlining heuristics that yield good improvements at the expense of a reasonable increase in text size. ใ€ฐ๏ธ๐Ÿ‘‹
  • Profit! Implement in the inliner the heuristics identified above ๐Ÿ‘Œ

As a bonus point, the intern gets to write a paper about this. ๐Ÿคฃ

@dlsniper: I'd like to avoid a //go:inline comment if we can. I don't think it solves the problem well, as the inlining decision should probably depend on characteristics of the call site (e.g. in a loop, constant arguments, etc.), not just the function being called.

The compiler has no support for //go: directives at statement or expression scope, only at global scope. Not that it couldn't be added, but it's significant work.

@randall77 thank you for replying so quick on this. I understand that there is a fair amount of work, and concern at the same time with regards to how users will use this functionality. I think that the approach of having this enabled by default but with conservative defaults would be a good start.

However, what I have in mind when suggesting the introduction of //go:inline that could be added at call site is that the experienced users will have the understanding for how to use it and will be able to assert, via benchmarks, which approach works better for them when the compiler defaults are not enough.

From there, that could be collected as a feedback or observed how this is used in the wild, in order to allow further experimentation / changes to the heuristics / defaults. Much like what @CAFxX suggested but without dedicating an intern and a lot of hardware to running benchmarks. For example, in all of my use-cases so far, I would gladly trade a few more MB of binary size for better runtime speed. I understand that others may not wish to do the same, which is I why I think that satisfying all these requirements would be better left to the users.

One of the other interesting options of having this as a compiler directive is allowing of fine-tuning the standard library code by performing analysis on the existing benchmarks.

  • Do I think it could be potentially abused / misused by users that do not understand what this option will do? Yes, I do. But then the burden would be entirely on the users rather than on the Go team to figure out "the best" way to move forward with this.
  • Who are the people that I target with this option? This allows people that understand what they are doing to further fine-tune their code at a level that they would not have access today, which I believe it's a good step in the direction of giving some control to the users while providing solid defaults.
  • Do I like the idea of introducing more magic directives to the compiler? I do not, but I also do not see another way to give these hints to the compiler.

Hope this helps. I'll continue to watch this issue and look forward to how this will work out. Thank you.

@dlsniper @randall77 My only concern with enforcing the //go:inline is that it only scales for final executables, not for libraries. Imagine i put //go:inline all over my lib, and a user depends on my lib. It wasn't the user's decision - it was the author of the lib forcing his decision on the users.

If we do //go:inline, let it be a hint to the compiler, that if this function doesn't make the cut but it is within reason, pls inline it. E.g. let's say only functions up to a cost of 10 are inlined, but my function has a cost of 12, meaning it will not be inlined by default. But as the cost is within 30% over threshold (ie cost less than 10+30% = 13), and the author says "pls inline", then it will be inlined, but if cost is over 30%, the hint will be disregarded.

This is similar to how c++ inline keyword works, as a hint.

Now, I personally don't want a //go:inline. I prefer that the "general" (conservative) rules/heuristics for inlining are reasonable and fair and known and published. The compiler can still tweak outside of the general/published/conservative rules, but authors will work within those published ones and be happy.

My 2 cents.

mvdan commented

As far as I know, it has been core to Go's design (including its compiler) to have as few knobs and flags as possible. This includes flags like -O4 and compiler directives in the code.

There have been no knobs to control inlining until now; why should enabling mid-stack inlining change that?

The problem with inlining directives is that people are notoriously bad at maintaining them as needs shift and code changes. We've resisted exposing inlining directives even to the runtime (which has several directives not available to user code) because we know they'll get stale and lead to code bloat and less performance. Instead, we have a test that checks that key functions are being inlined by the compiler's heuristics; and even that list gets out of date quickly.

Our long-term (albeit vague) plan is to use profile-guided optimization to make inlining decisions, rather than hand-crafted heuristics or developer annotations. It'll take a while to get there, but it fits very nicely with the Go model of doing the right thing automatically.

@dlsniper @randall77 @aclements Given that code-freeze is in a week, will this be making it into go 1.11 in some form? There seems to have been zero movement here.

There some clear wins here, even with simple heuristics e.g. inlining leaf functions that panic, short delegate functions that just call other functions, leaf functions with switch statements, etc. These will make reflect faster, which will impact just about every go program (using json, fmt, etc).

Thanks.

We're not sure yet. We'd like to get something in, but the current heuristics are too aggressive. We've seen code size blowups of 100%. We've also seen net slowdowns.

We're thinking of trying to enable this for 1.11, but with a much stricter heuristic. But we don't know what that heuristic might be yet. Unfortunately, this is no one's top priority at the moment.

If you have particular programs that do get significant speedups from mid-stack inlining, please post them. It will help us guide the choice of heuristic.

See also: https://go-review.googlesource.com/c/go/+/109918

Quick summary of where we are:

  • calling panic no longer forbids inlining.
  • -l=4 gets midstack inlining, if you want to play with it. Binaries get bigger, compiles take longer. We're interested in feedback on how this works for people, especially when it doesn't work.
  • the compiler itself is not helped by midstack inlining; a -l=4-built (but not -l=4-compiling) compiler runs slower than normal.
  • some benchmarks speed up nicely

The bigger+slower compiler is worrisome, which is the main reason this is not enabled yet; if this happened to your binary, you'd not be happy. The minimum plan is to understand how to manage inlining so bigger+slower at least doesn't happen to the compiler, and hope that it generalizes. A more ambitious plan is to build some sort of a feedback framework so that it's clear where inlining would actually help, instead of just guessing. Or we could use machine learning....

CAFxX commented

Is -l=4 still unsupported for production use? Or is it now supported for production but with potential performance regressions (like, say, -O3)?

I am not sure of the official position, and it's not tested as much as it should be (i.e., I need to see about whether we can have a -l=4 test box) but it's supposed to at least execute correctly and we'd like to know when it doesn't, which I think is different from "you own the pieces". Debugging is also not well-tested for -l=4 binaries.

I've been rebenchmarking the compiler to check how inlining changes its performance, and the short answer is it isn't worse, but it isn't better, and without a noinline annotation on one method it doubles the size of the binary (with the annotation, it's only 50% larger). I don't think we want the noinline annotations to become part of common practice for using go (we use them in tests, very helpful there) but on the other hand it can also be a good way of figuring out the sort of inlining mistakes the compiler needs to not make in order to turn this on in general.

I ran some of our BLAS benchmarks with no significant effect. Do you know if functions with asm stubs can be inlined (assuming the build tags are such that the asm is not actually used)?

I've been rebenchmarking the compiler to check how inlining changes its performance, and the short answer is it isn't worse, but it isn't better, and without a noinline annotation on one method it doubles the size of the binary (with the annotation, it's only 50% larger).

Is this a statement for amd64 or for other GOARCHes too? I would expect to see more improvement on ppc64x because of the high cost of loading and storing the arguments and return values.

Ping. Any chance this gets in for go 1.12?

Someone needs to look into better heuristics, because the current rules tend to over-bloat the generated binary. Someone is not supposed to be me, though I really want it to happen.

Phew - this feature may never get done ;(

Change https://golang.org/cl/147361 mentions this issue: cmd/compile: encourage inlining of functions with single-call bodies

Do we call this fixed (1.12) or work-in-progress (1.13)?
Either way, we're not done with inlining, but we're also unlikely to do more in 1.12.

I'm happy to punt any future work to 1.13.

@dr2chase @randall77

https://golang.org/cl/147361 sets inlineExtraCallCost = 60.

I want to make an argument for setting inlineExtraCallCost = 56. This is a similarly conservative value (like 60), preserves the original premise of allowing at most 1 call for inlining, maintains similar <5% increase in cmd/compile and cmd/go binaries, and allows slightly more code to be inlined.

I captured most of my arguments in https://golang.org/cl/147361 , but want to capture it here in the issue so it doesn't get lost.

To illustrate, I will first show the cost increases for cmd/go and cmd/compile, for various settings of inlineExtraCallCost. Then I will show some typical sample code which cost just 1-4 more than the budget - switching budget from 60 to 56 will allow these get inlined.

Cost increases for cmd/go and cmd/compile for various settings of inlineExtraCallCost

I updated $GOROOT/src/cmd/compile/internal/gc/inl.go to set inlineExtraCallCost to 80 60 56 55 54 53 50 41 40 30 1, then I ran make.bash, and collected the sizes of $GOROOT/bin/go $GOROOT/pkg/tool/darwin_amd64/compile. I then checked how each increased using the baseline of 80 (value as at go 1.11).

Results:

cc = 60: go: +2.945%, compile: +4.049%
cc = 56: go: +3.243%, compile: +4.992%
cc = 55: go: +3.362%, compile: +5.224%
cc = 54: go: +3.297%, compile: +12.178%
cc = 53: go: +3.354%, compile: +12.213%
cc = 50: go: +3.502%, compile: +12.352%
cc = 41: go: +4.133%, compile: +12.585%
cc = 40: go: +4.167%, compile: +12.621%
cc = 30: go: +4.802%, compile: +15.026%
cc = 1: go: +13.013%, compile: +32.246%

This shows that, up to about cc=55, the size increases are modest (in line with what cc=60 gives).

Typical sample code which cost just 1-4 more than the budget

The sample codebase which illustrates my usage is below. My library (github.com/ugorji/go/codec) is a encoder/decoder which can work off []byte or io.Reader/Writer.

//+build ignore

// To test this (assuming file is called inlining.go), use:
//
// go build -gcflags -m=2 inlining.go 2>&1 | grep "cannot inline" | grep -v "go:noinline"
// go run inlining.go
//
// Ideally, this is a buffered reader/writer, where you are reading/writing bytes a few at a time.
// If buffer holds 4096, and you read a token at a time (as in a decoder), then you
// may read 4096 times before having to fill again. Each read is just getting an element
// of an array, and incrementing a cursor.
//
// Paying the cost of a method call is too much.
// Yet that cost is paid, for the rare times that a fill() is needed.
//
// Note: inlineExtraCallCost=56 is best compromise, allowing some internal helper calls to be inlined.

package main

import (
	"fmt"
	"io"
)

type Rh struct {
	R
}

type R struct {
	cursor int
	avail  int
	bytes  bool
	buffer []byte
}

func main() {
	var r Rh
	r.buffer = make([]byte, 64)
	for i := range r.buffer {
		r.buffer[i] = 'A'
	}
	fmt.Printf("Rh.readn  5: %s\n", r.readn(5))
	fmt.Printf("Rh.readn 17: %s\n", r.readn(17))
	fmt.Printf("Rh.readn 96: %s\n", r.readn(96))

	fmt.Printf("Rh.writen:  %d\n", r.writen([]byte("hello")))
	fmt.Printf("Rh.writen2: %d\n", r.writen2('h', 'e'))
	fmt.Printf("Rh.writen22: %d\n", r.writen22('h', 'e'))
	fmt.Printf("R.writen2: %d\n", r.R.writen2('h', 'e'))
}

//go:noinline
func (r *R) fill() { // not inlineable
	// in reality, this reaches out to the network to fill the buffer
	r.avail = len(r.buffer)
	r.cursor = 0
}

//go:noinline
func (r *R) doWriten2(b1, b2 byte) {
}

// inlineable method - to see how it affects inlining cost
func (r *R) doReadn(n int) []byte { // inlineable // cost=38
	if r.avail == 0 { // cost=5
		panic(io.EOF) // cost=3
	}
	if n > r.avail { // cost=5
		panic(io.ErrUnexpectedEOF) //cost=3
	}
	r.avail -= n                           // cost=4
	r.cursor += n                          // cost=4
	return r.buffer[r.cursor-n : r.cursor] // slicing cost=9, return = ?
}

func (r *R) readn(n int) []byte { // cost=107
	if n > r.avail { // cost=5
		r.fill() // cost=63
	}
	return r.doReadn(n) // cost=39
}

// simulate accessing methods/fields of struct
func (r *R) writen2(b1, b2 byte) int { // cost=80
	if r.bytes { // cost=2
		r.buffer = append(r.buffer, b1, b2) // cost=8
	} else {
		r.doWriten2(b1, b2) // cost = 65 (call=60 + 2 args + ???)
	}
	return len(r.buffer) // cost=5 (return cost=1, len cost=4)
}

// simulate accessing methods/fields of embedded member
func (r *Rh) writen(b []byte) int { // cost=83
	if r.bytes { // cost=3
		r.buffer = append(r.buffer, b...) // cost=9
	} else {
		r.fill() // cost = 65
	}
	return len(r.buffer) // cost=6 (return cost=1, len cost=5)
}

// simulate accessing methods/fields of struct members
func (r *Rh) writen2(b1, b2 byte) int { // cost=86
	if r.R.bytes { // cost=3
		r.R.buffer = append(r.R.buffer, b1, b2) // cost=10
	} else {
		r.R.doWriten2(b1, b2) // cost = 67 (call=60 + 2 args + ???)
	}
	return len(r.R.buffer) // cost=6 (return cost=1, len cost=5)
}

// simulate accessing methods/fields of struct members
func (r *Rh) writen22(b1, b2 byte) int { // cost=88
	return r.R.writen2(b1, b2)
}

Running

go build -gcflags -m=2 inlining.go 2>&1 | grep "cannot inline" | grep -v "go:noinline"

We get

./inlining.go:76:6: cannot inline (*R).readn: function too complex: cost 107 exceeds budget 80
./inlining.go:94:6: cannot inline (*Rh).writen: function too complex: cost 83 exceeds budget 80
./inlining.go:104:6: cannot inline (*Rh).writen2: function too complex: cost 86 exceeds budget 80
./inlining.go:114:6: cannot inline (*Rh).writen22: function too complex: cost 88 exceeds budget 80
./inlining.go:36:6: cannot inline main: unhandled op RANGE

With inlineExtraCallCost=56, writen will be inlined. This allows us do something similar in the code i.e. inline the fast-path while ensuring slow-path is not inlined, and keeping the cost under 80 so the whole thing is inlined. This allows the append(...) and b[n] calls to be inlined, eliding a function call in this fast path.

Currently, in github.com/ugorji/go/codec, in my critical path, I get:

./encode.go:999:6: cannot inline (*encWriterSwitch).writen1: function too complex: cost 81 exceeds budget 80
./encode.go:985:6: cannot inline (*encWriterSwitch).writeb: function too complex: cost 81 exceeds budget 80
./encode.go:1006:6: cannot inline (*encWriterSwitch).writen2: function too complex: cost 84 exceeds budget 80
./encode.go:992:6: cannot inline (*encWriterSwitch).writestr: function too complex: cost 81 exceeds budget 80

This is so so close, and cc=56 will allow all these functions be inlined.

It will be nice if we can validate that cc=55 or cc=56 is fair and possibly get it in for go 1.12.

Thanks much!

I have a not idiomatic idea but can we just export inlineExtraCallCost as a GOINLINEEXPERIMENT (like it was for vendoring) and gave it for those who need it? ๐Ÿ˜ƒ

Thanks for doing this, I will run a bunch of benchmarks over the weekend to see how it generalizes.
It would be really interesting to know what happened between 55 and 54.

How do you feel about 57?

I ran my pile of selected benchmarks from github over the weekend, "stuff happens" for a couple of them at 56 but not at 57. There seems to be minor improvement at 57 over 60, though most changes are indistinguishable from noise. Making sense of why things sometimes gets notably worse would be interesting.

@ugorji, how much faster does your code run with the lower call cost for inlining?

Summary of binary sizes, compile times, and benchmark runs

Thanks @dr2chase I will run my code tomorrow with 57 and report on my findings.

Also, is it possible to share your summaries outside google, or at least share with me directly so I can view - email is ugorji @ gmail dot com .

Thanks.

@dr2chase

Ran my code with inlineExtraCallCost=80, 60 and 57, and captured run my benchmark runtimes and did some analysis:

# running with cc=80, and checking for k=60, k=57
(compared to cc=80)    cc = 60: bytes: -3.107%, io-static-buf: -2.470%, io-dynamic-buf: -1.545%
(compared to cc=80)    cc = 57: bytes: -8.395%, io-static-buf: -7.246%, io-dynamic-buf: -6.676%

# running with cc=60, and checking for k=57
(compared to cc=60)    cc = 57: bytes: -5.457%, io-static-buf: -4.896%, io-dynamic-buf: -5.211%

In plain english, with cc=57, my usecase runs about 8.4% faster compared to cc=80, and about 5.5% faster compared to cc=60, for the common case where folks just want to encode into a []byte (not io.Reader). This is a significant performance improvement in my use-case, and encourages folks to not do the codecgen path (which kubernetes did previously before they moved to another library and etcd still does).

Thanks so much for taking the time to investigate.

The simple script I used is below:

declare -a zb zi zf
# runtimes for cc=80, 60 and 57 below
zb[80]=3786935
zi[80]=4333331
zf[80]=4195797
zb[60]=3669257
zi[60]=4226259
zf[60]=4130957
zb[57]=3469019
zi[57]=4019335
zf[57]=3915682

cc=80
for k in 60 57
do
  b=$(bc -l <<< "scale=3;(${zb[${k}]}-${zb[${cc}]})*100/${zb[${cc}]}")
  i=$(bc -l <<< "scale=3;(${zi[${k}]}-${zi[${cc}]})*100/${zi[${cc}]}")
  f=$(bc -l <<< "scale=3;(${zf[${k}]}-${zf[${cc}]})*100/${zf[${cc}]}")
  echo "(compared to cc=${cc})    cc = $k: bytes: ${b}%, io-static-buf: ${i}%, io-dynamic-buf: ${f}%"
done

It is supposed to be possible to share that, but I managed not to.
I'm not sure how I did it in the past.
Here is a PDF:
Fine tuning inline call cost parameter.pdf

Thanks @dr2chase for the extremely detailed analysis you did.

Looking forward to the CL.

@dr2chase @khr

Any chance we get cc=57 in by beta?

Change https://golang.org/cl/151977 mentions this issue: cmd/compile: decrease inlining call cost from 60 to 57

Change https://golang.org/cl/156362 mentions this issue: sync: make Once.Do mid-stack inlineable

I'm going to declare this issue done.
We can always work more on the heuristics, and inlining loops and such, but the basic mechanism is done.

Change https://golang.org/cl/174839 mentions this issue: cmd/compile: remove outdate TODO in inl.go

Change https://golang.org/cl/195818 mentions this issue: runtime: remove unneeded noinline directives