This is a tool to convert assembly as generated by a C/C++ compiler into Golang assembly. It is meant to be used in combination with asm2plan9s in order to automatically generate pure Go wrappers for C/C++ code (that may for instance take advantage of compiler SIMD intrinsics or template<>
code).
Mode of operation:
$ c2goasm -a /path/to/some/great/c-code.s /path/to/now/great/golang-code_amd64.s
You can optionally nicely format the code using asmfmt by passing in an -f
flag.
This project has been developed as part of developing a Go wrapper around Simd. However it should also work with other projects and libraries. Keep in mind though that it is not intented to 'port' a complete C/C++ project in a single action but rather do it on a case-by-case basis per function/source file (and create accompanying high level Go code to call into the assembly code).
$ c2goasm --help
Usage of c2goasm:
-a Immediately invoke asm2plan9s
-c Compact byte codes
-f Format using asmfmt
-s Strip comments
Here is a simple C function doing an AVX2 intrinsics computation:
void MultiplyAndAdd(float* arg1, float* arg2, float* arg3, float* result) {
__m256 vec1 = _mm256_load_ps(arg1);
__m256 vec2 = _mm256_load_ps(arg2);
__m256 vec3 = _mm256_load_ps(arg3);
__m256 res = _mm256_fmadd_ps(vec1, vec2, vec3);
_mm256_storeu_ps(result, res);
}
Compiling into assembly gives the following
__ZN14MultiplyAndAddEPfS1_S1_S1_: ## @_ZN14MultiplyAndAddEPfS1_S1_S1_
## BB#0:
push rbp
mov rbp, rsp
vmovups ymm0, ymmword ptr [rdi]
vmovups ymm1, ymmword ptr [rsi]
vfmadd213ps ymm1, ymm0, ymmword ptr [rdx]
vmovups ymmword ptr [rcx], ymm1
pop rbp
vzeroupper
ret
Running c2goasm
will generate the following Go assembly (eg. saved in MultiplyAndAdd_amd64.s
)
//+build !noasm !appengine
// AUTO-GENERATED BY C2GOASM -- DO NOT EDIT
TEXT ·_MultiplyAndAdd(SB), $0-32
MOVQ vec1+0(FP), DI
MOVQ vec2+8(FP), SI
MOVQ vec3+16(FP), DX
MOVQ result+24(FP), CX
LONG $0x0710fcc5 // vmovups ymm0, yword [rdi]
LONG $0x0e10fcc5 // vmovups ymm1, yword [rsi]
LONG $0xa87de2c4; BYTE $0x0a // vfmadd213ps ymm1, ymm0, yword [rdx]
LONG $0x0911fcc5 // vmovups yword [rcx], ymm1
VZEROUPPER
RET
This needs to be accompanied by the following Go code (in MultiplyAndAdd_amd64.go
)
//go:noescape
func _MultiplyAndAdd(vec1, vec2, vec3, result unsafe.Pointer)
func MultiplyAndAdd(someObj Object) {
_MultiplyAndAdd(someObj.GetVec1(), someObj.GetVec2(), someObj.GetVec3(), someObj.GetResult()))
}
And as you may have gathered the amd64.go file needs to be in place in order for the arguments names to be derived (and allow go vet
to succeed).
We have run benchmarks of c2goasm
versus cgo
for both Go version 1.7.5 and 1.8.1. You can find the c2goasm
benchmark test in test/
and the cgo
test in cgocmp/
respectively. Here are the results for both versions:
$ benchcmp ../cgocmp/cgo-1.7.5.out c2goasm.out
benchmark old ns/op new ns/op delta
BenchmarkMultiplyAndAdd-12 382 10.9 -97.15%
$ benchcmp ../cgocmp/cgo-1.8.1.out c2goasm.out
benchmark old ns/op new ns/op delta
BenchmarkMultiplyAndAdd-12 236 10.9 -95.38%
As you can see Golang 1.8 has made a significant improvement (38.2%) over 1.7.5, but it is still about 20x slower than directly calling into assembly code as wrapped by c2goasm
.
The basic process is to (in the prologue) setup the stack and registers as how the C code expects this to be the case, and upon exiting the subroutine (in the epilogue) to revert back to the golang world and pass a return value back if required. In more details:
- Define assembly subroutine with proper golang decoration in terms of needed stack space and overall size of arguments plus return value.
- Function arguments are loaded from the golang stack into registers and prior to starting the C code any arguments beyond 6 are stored in C stack space.
- Stack space is reserved and setup for the C code. Depending on the C code, the stack pointer maybe aligned on a certain boundary (especially needed for code that takes advantages of SIMD instructions such as AVX etc.).
- A constants table is generated (if needed) and any
rip
-based references are replaced with proper offsets to where Go will put the table.
- Arguments need (for now) to be 64-bit size, meaning either a value or a pointer (this requirement will be lifted)
- Maximum number of 14 arguments (hard limit -- if you hit this maybe you should rethink your api anyway...)
- Generally no
call
statements (thus inline your C code) with a couple of exceptions for functions such asmemset
andmemcpy
(seeclib_amd64.s
)
For eg. projects using cmake, here is how to see a list of assembly targets
$ make help | grep "\.s"
To see the actual command to generate the assembly
$ make -n SimdAvx2BgraToGray.s
For now just the AMD64 architecture is supported. Also ARM64 should work just fine in a similar fashion but support is lacking at the moment.
The following compilers have been tested:
clang
(Apple LLVM version) on OSX/darwinclang
on linux
Compiler flags:
-masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti
Flag | Explanation |
---|---|
-masm=intel |
Output Intel syntax for assembly |
-mno-red-zone |
Do not write below stack pointer (avoid red zone) |
-mstackrealign |
Use explicit stack initialization |
-mllvm -inline-threshold=1000 |
Higher limit for inlining heuristic (default=255) |
-fno-asynchronous-unwind-tables |
Do not generate unwind tables (for debug purposes) |
-fno-exceptions |
Disable exception handling |
-fno-rtti |
Disable run-time type information |
The following flags are only available in clang -cc1
frontend mode (see below):
Flag | Explanation |
---|---|
-fno-jump-tables |
Do not use jump tables as may be generated for select statements |
As per the clang FAQ, clang -cc1
is the frontend, and clang
is a (mostly GCC compatible) driver for the frontend. To see all options that the driver passes on to the frontend, use -###
like this:
$ clang -### -c hello.c
"/usr/lib/llvm/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" etc. etc. etc.
To see all command line flags use either clang --help
or clang --help-hidden
for the clang driver or clang -cc1 -help
for the frontend.
Using the LLVM optimizer (opt) you can further optimize the code generation. Use opt -help
or opt -help-hidden
for all available options.
An option can be passed in via clang
using the -mllvm <value>
option, such as -mllvm -inline-threshold=1000
as discussed above.
Also LLVM allows you to tune specific functions via function attributes like define void @f() alwaysinline norecurse { ... }
.
For now GCC code will not work out of the box. However there is no reason why GCC should not work fundamentally (PRs are welcome).
c2goasm is released under the Apache License v2.0. You can find the complete text in the file LICENSE.
Contributions are welcome, please send PRs for any enhancements.