golang/go

cmd/compile: make encoding/binary.LittleEndian.* have negligible inlining cost on amd64

Closed this issue · 13 comments

encoding/binary.LittleEndian.* typically compile down to a single instruction on amd64. Their inlining cost is very high. I suggest we add a hack to the inlining budget to consider them to be an intrinsic, for the purposes of cost. Probably the same for other platforms as well.

cc @mdempsky

mvdan commented

Are there any other such std APIs that we should teach the inliner about? For example, I imagine many math/bits APIs also compile to one or two amd64 instructions.

Unsafe conversions (see #42739).

Related issue (#42788)

Most of the math/bits APIs are semantically inlined, so they aren't really in the same bucket.

"semantically inlined" = the compiler treats these APIs specially. It throws away their Go implementation and uses a compiler-provided one. The good part is that it always optimizes (unless you give -N, I think?). The bad part is that if you copy the implementation into your own function, you don't get the optimizations.

encoding/binary is not semantically inlined. They just reduce to good code using standard optimizations. The good part of that is if you copy an encoding/binary body into your own function, the optimization still happens.

Not particularly relevant to this issue, other than to note that math/bits probably doesn't need to be included here.

encoding/binary is not semantically inlined. They just reduce to good code using standard optimizations.

My understanding is that inlining cost is currently based on the function AST. Would it make sense to compute a “compiled inlining cost” and stash that somewhere (maybe in the export data) for more accurate cost estimates? (Is there already a separate issue for that, beyond #17566?)

I don't think there is a separate issue.
Yes, cost estimates based on AST size certainly suck.
We could record with each function the count or bytes of instructions that it compiles to, which would be better.
Within a package this can get tricky because we inline everything before we compile anything, but cross-package that works.

Would it make sense to compute a “compiled inlining cost” and stash that somewhere (maybe in the export data) for more accurate cost estimates?

The long term aspiration is to do inlining after we're in SSA form and have had a chance to apply some initial optimizations. I think that's a ways off still though.

In the mean time, I think some ad-hoc special cases for notable standard library functions seems fine. Or maybe a std-only //go:inlinecost directive that allows use to override the inlining cost.

We could record with each function the count or bytes of instructions that it compiles to, which would be better.

That's an interesting idea.

We could record with each function the count or bytes of instructions that it compiles to, which would be better.

I’ve experimented with this.

It yields fairly drastically different inlining results—appends are suddenly crazy expensive, as is code that can panic, because the cold code paths count too. It’s probably still better, but I didn’t pursue it because of the inconsistency vs intra-package inlining.

dsnet commented

Whatever happens, I think we should avoid special casing just encoding/binary.LittleEndian.*. There are a number of equivalent functions that avoid the encoding/binary package as a dependency that may want to benefit from this as well (e.g., "compress/flate".load64)

@dsnet I hear you but special casing is easy and can happen ~immediately. Recognizing equivalent forms is definitely not easy in anything resembling the current inliner. These routines are used heavily in high performance code because it is one way to do something like vectorization. I'd rather not make the perfect to the enemy of the good here.

dsnet commented

Random thought: Should little-endian and big-endian functionality be moved to math/bits?

With the acceptance of #395, you could imagine an API that operates on fixed-width arrays rather than pointers:

func LoadUint32LE(v [4]byte) uint32
func StoreUint32LE(v uint32) [4]byte

Change https://golang.org/cl/349931 mentions this issue: cmd/compile: make encoding/binary loads/stores cheaper to inline

Change https://go.dev/cl/431015 mentions this issue: cmd/compile: make encoding/binary appends cheaper to inline