golang/go

cmd/asm: add neon, vector instructions for arm

Opened this issue · 13 comments

by byron.rakitzis:

go1.2

In contrast to the amd64 port, the arm port of the Go assembler does not recognize SIMD
instructions ("V…") or vector registers (D or Q).

It would be useful for us (we are writing custom speedups for a project using Intel SSE,
and would care to do the same for ARM), but it would also be useful for the Go library
itself if the library functions which have SIMD speedups in xxx_amd64.s had analogous
speedups in xxx_arm.s

Thank you,

Byron Rakitzis.

Comment 1:

Byron,
Could you please list the complete set of instructions you need.

Status changed to WaitingForReply.

Comment 2 by byron.rakitzis:

Certainly,
These include:
load, store, move, table lookup, shift, xor:
VLD1
VST1
VMOV
VSHR
VTBL
VEOR
This is highly selective, of course. We also use the D and Q register set
(I hope that is self-evident).
Byron.

Comment 3:

Marking 1.3Maybe. This is a minor, non-breaking change that could open up some
significant performance improvements.

Labels changed: added repo-main, release-go1.3maybe.

Status changed to Accepted.

rsc commented

Comment 4:

Labels changed: added release-go1.4, removed release-go1.3maybe.

rsc commented

Comment 5:

I'd like to see this happen but it's going to have to wait for the next release.

Labels changed: added release-go1.5, removed release-go1.4.

This will not happen for the 1.5 release

rsc commented

I agree this would be useful, and I apologize that we haven't had a chance to do it yet. Note that if you really need the instructions you can figure out what the encodings are (for example using the GNU assembler) and then use WORD directives to insert them in your assembly. I know that's less than ideal, but it's a workaround.

Right now there's more we'd like to do than we have bandwidth for, so the reality is that this one is unplanned.

Any clue how much effort is needed to implement support for NEON? The "quick guide to Go's assembler" says that updating go's assembler is "straightforward" - I'm looking for some more details. May someone point me to some PR/diff with some similar implementation that was already done (for example SIMD for intel)?
Many thanks

Specialized code also has to feature-detect for NEON, so a flag needs to be added to internal/cpu (and correspondingly x/sys/cpu) for HasNEON. On linux the flag is hwcap_NEON = 1 << 12.

https://translate.google.com/translate?sl=ja&tl=en&u=https://future-architect.github.io/articles/20201203/

I did benchmark with M1 and 10th Gen Core i5, Ryzen 9 on https://github.com/SimonWaldherr/golang-benchmarks. I got interesting result.

  • M1 is much faster than Core i5/Ryzen (basically, took 50%-33% less time to complete)
  • CRC32, SHA1, SHA256 test took much time than other CPUs and Rosetta2 translation

I think M1 native Go implementation doesn't use NEON, but Rosetta2 translate SSE instructions into NEON. I read hash/crc32 code, only amd64.s uses SIMD instructions. So I suppose this issue is important for improving benchmark result of ARM.

any news here?
it's 2022, arm instances are available across the 3 major clouds...

https://translate.google.com/translate?sl=ja&tl=en&u=https://future-architect.github.io/articles/20201203/

I did benchmark with M1 and 10th Gen Core i5, Ryzen 9 on https://github.com/SimonWaldherr/golang-benchmarks. I got interesting result.

  • M1 is much faster than Core i5/Ryzen (basically, took 50%-33% less time to complete)
  • CRC32, SHA1, SHA256 test took much time than other CPUs and Rosetta2 translation

I think M1 native Go implementation doesn't use NEON, but Rosetta2 translate SSE instructions into NEON. I read hash/crc32 code, only amd64.s uses SIMD instructions. So I suppose this issue is important for improving benchmark result of ARM.

I second your statement. It's 2023, ARM is getting more popular by the day and ARM servers are now available on all the 3 major cloud service providers. On top of that, Apple Silicon's performance is absolutely phenomenal, golang with NEON support on Apple Silicon will be just amazing.

Rust already supports it, I think it's high time golang supports it too.