c2goasm: C to Go Assembly

Introduction

This is a tool to convert assembly as generated by a C/C++ compiler into Golang assembly. It is meant to be used in combination with asm2plan9s in order to automatically generate pure Go wrappers for C/C++ code (that may for instance take advantage of compiler SIMD intrinsics or template<> code).

Mode of operation:

$ c2goasm -a /path/to/some/great/c-code.s /path/to/now/great/golang-code_amd64.s

You can optionally nicely format the code using asmfmt by passing in an -f flag.

This project has been developed as part of developing a Go wrapper around Simd. However it should also work with other projects and libraries. Keep in mind though that it is not intented to 'port' a complete C/C++ project in a single action but rather do it on a case-by-case basis per function/source file (and create accompanying high level Go code to call into the assembly code).

Command line options

$ c2goasm --help
Usage of c2goasm:
  -a	Immediately invoke asm2plan9s
  -c	Compact byte codes
  -f	Format using asmfmt
  -s	Strip comments

A simple example

Here is a simple C function doing an AVX2 intrinsics computation:

void MultiplyAndAdd(float* arg1, float* arg2, float* arg3, float* result) {
    __m256 vec1 = _mm256_load_ps(arg1);
    __m256 vec2 = _mm256_load_ps(arg2);
    __m256 vec3 = _mm256_load_ps(arg3);
    __m256 res  = _mm256_fmadd_ps(vec1, vec2, vec3);
    _mm256_storeu_ps(result, res);
}

Compiling into assembly gives the following

__ZN14MultiplyAndAddEPfS1_S1_S1_: ## @_ZN14MultiplyAndAddEPfS1_S1_S1_
## BB#0:
        push          rbp
        mov           rbp, rsp
        vmovups       ymm0, ymmword ptr [rdi]
        vmovups       ymm1, ymmword ptr [rsi]
        vfmadd213ps   ymm1, ymm0, ymmword ptr [rdx]
        vmovups       ymmword ptr [rcx], ymm1
        pop           rbp
        vzeroupper
        ret

Running c2goasm will generate the following Go assembly (eg. saved in MultiplyAndAdd_amd64.s)

//+build !noasm !appengine
// AUTO-GENERATED BY C2GOASM -- DO NOT EDIT

TEXT ·_MultiplyAndAdd(SB), $0-32

	MOVQ vec1+0(FP), DI
	MOVQ vec2+8(FP), SI
	MOVQ vec3+16(FP), DX
	MOVQ result+24(FP), CX

	LONG $0x0710fcc5             // vmovups    ymm0, yword [rdi]
	LONG $0x0e10fcc5             // vmovups    ymm1, yword [rsi]
	LONG $0xa87de2c4; BYTE $0x0a // vfmadd213ps    ymm1, ymm0, yword [rdx]
	LONG $0x0911fcc5             // vmovups    yword [rcx], ymm1

	VZEROUPPER
	RET

This needs to be accompanied by the following Go code (in MultiplyAndAdd_amd64.go)

//go:noescape
func _MultiplyAndAdd(vec1, vec2, vec3, result unsafe.Pointer)

func MultiplyAndAdd(someObj Object) {

	_MultiplyAndAdd(someObj.GetVec1(), someObj.GetVec2(), someObj.GetVec3(), someObj.GetResult()))
}

And as you may have gathered the amd64.go file needs to be in place in order for the arguments names to be derived (and allow go vet to succeed).

Benchmark against cgo

We have run benchmarks of c2goasm versus cgo for both Go version 1.7.5 and 1.8.1. You can find the c2goasm benchmark test in test/ and the cgo test in cgocmp/ respectively. Here are the results for both versions:

$ benchcmp ../cgocmp/cgo-1.7.5.out c2goasm.out 
benchmark                      old ns/op     new ns/op     delta
BenchmarkMultiplyAndAdd-12     382           10.9          -97.15%

$ benchcmp ../cgocmp/cgo-1.8.1.out c2goasm.out 
benchmark                      old ns/op     new ns/op     delta
BenchmarkMultiplyAndAdd-12     236           10.9          -95.38%

As you can see Golang 1.8 has made a significant improvement (38.2%) over 1.7.5, but it is still about 20x slower than directly calling into assembly code as wrapped by c2goasm.

Converted projects

go-cv-simd (WIP)

Internals

The basic process is to (in the prologue) setup the stack and registers as how the C code expects this to be the case, and upon exiting the subroutine (in the epilogue) to revert back to the golang world and pass a return value back if required. In more details:

Define assembly subroutine with proper golang decoration in terms of needed stack space and overall size of arguments plus return value.
Function arguments are loaded from the golang stack into registers and prior to starting the C code any arguments beyond 6 are stored in C stack space.
Stack space is reserved and setup for the C code. Depending on the C code, the stack pointer maybe aligned on a certain boundary (especially needed for code that takes advantages of SIMD instructions such as AVX etc.).
A constants table is generated (if needed) and any rip-based references are replaced with proper offsets to where Go will put the table.

Limitations

Arguments need (for now) to be 64-bit size, meaning either a value or a pointer (this requirement will be lifted)
Maximum number of 14 arguments (hard limit -- if you hit this maybe you should rethink your api anyway...)
Generally no call statements (thus inline your C code) with a couple of exceptions for functions such as memset and memcpy (see clib_amd64.s)

Generate assembly from C/C++

For eg. projects using cmake, here is how to see a list of assembly targets

$ make help | grep "\.s"

To see the actual command to generate the assembly

$ make -n SimdAvx2BgraToGray.s

Supported golang architectures

For now just the AMD64 architecture is supported. Also ARM64 should work just fine in a similar fashion but support is lacking at the moment.

Compatible compilers

The following compilers have been tested:

clang (Apple LLVM version) on OSX/darwin
clang on linux

Compiler flags:

-masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti

Flag	Explanation
`-masm=intel`	Output Intel syntax for assembly
`-mno-red-zone`	Do not write below stack pointer (avoid red zone)
`-mstackrealign`	Use explicit stack initialization
`-mllvm -inline-threshold=1000`	Higher limit for inlining heuristic (default=255)
`-fno-asynchronous-unwind-tables`	Do not generate unwind tables (for debug purposes)
`-fno-exceptions`	Disable exception handling
`-fno-rtti`	Disable run-time type information

The following flags are only available in clang -cc1 frontend mode (see below):

Flag	Explanation
`-fno-jump-tables`	Do not use jump tables as may be generated for `select` statements

`clang` vs `clang -cc1`

As per the clang FAQ, clang -cc1 is the frontend, and clang is a (mostly GCC compatible) driver for the frontend. To see all options that the driver passes on to the frontend, use -### like this:

$ clang -### -c hello.c
"/usr/lib/llvm/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" etc. etc. etc.

Command line flags for clang

To see all command line flags use either clang --help or clang --help-hidden for the clang driver or clang -cc1 -help for the frontend.

Further optimization and fine tuning

Using the LLVM optimizer (opt) you can further optimize the code generation. Use opt -help or opt -help-hidden for all available options.

An option can be passed in via clang using the -mllvm <value> option, such as -mllvm -inline-threshold=1000 as discussed above.

Also LLVM allows you to tune specific functions via function attributes like define void @f() alwaysinline norecurse { ... }.

What about GCC support?

For now GCC code will not work out of the box. However there is no reason why GCC should not work fundamentally (PRs are welcome).

Resources

License

c2goasm is released under the Apache License v2.0. You can find the complete text in the file LICENSE.

Contributing

Contributions are welcome, please send PRs for any enhancements.

zeroshade/c2goasm