golang/go

runtime: c-shared builds fail with musllibc

Opened this issue ยท 51 comments

Currently, some of the init_array functions provided by a c-shared build expect to be called with (argc, argv, envp) arguments as is done by glibc, but this isn't specified by the ELF gABI for DT_INIT_ARRAY (http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#init_fini), and isn't done with other libc implementations like musllibc.

CC @ianlancetaylor @mwhudson

musl does expose getauxval() for querying the auxiliary vector but does not provide any way to get the main program's argc/argv. The musl maintainers argue that this isn't something a shared library should have access to anyway.

Somewhat relatedly, C99 says [section 5.1.2.2.1]:

The parameters argc and argv and the strings pointed to by the argv array shall be modifiable by the program, and retain their last-stored values between program startup and program termination.

So in the shared library cases where a C main function runs, the Go runtime should conservatively assume the C program may legitimately mutate the argv strings. In particular, it can't use gostringnocopy on them like it currently does in goargs.

So I think in the cases where we need to play nicely with arbitrary C code:

  • syscall's {Clear,Get,Put,Set}env and Environ functions should all treat C as the source of truth for the environment, rather than keeping a local copy of the environment to manipulate and just trying to copy mutations to C. (E.g., even today, if cgo code calls setenv, it won't be visible to os.Getenv; whereas os.Setenv is visible to getenv.)
  • package runtime should use getauxval for accessing the aux value array, to avoid needing to locate the array in memory.

That would just leave needing to figure out a solution for os.Args. It would kinda suck, but maybe we could just leave it nil in the case where Go isn't acting as the program's entry point?

If we can't get os.Args, then we have to leave it nil. But I think we should only do that for environments where we can't get it. When we can get it, as for glibc, we should.

I just looked at glibc, Bionic, uClibc, musl, and dietlibc, as well as the dynamic linkers for FreeBSD, NetBSD, and OpenBSD. It looks like only glibc passes (argc, argv, envp) to the DSO init functions.

If we want to detect glibc at build time, it seems like we can include <limits.h> and test for defined(__GLIBC__) && !defined(__UCLIBC__). (The User-Agent saga continues as uClibc claims to be glibc.)

To detect at runtime, we could use a weak reference to a glibc-only symbol like gnu_get_libc_version and see if it resolves. I'm worried about making sure we pick a symbol that won't later be implemented by other C libraries though. (E.g., I found a thread where someone suggested adding a gnu_get_libc_version function to musl to make it compatible with Nvidia's binary drivers.)

Any other suggestions/ideas for detecting glibc?

For what it's worth, I tracked down that glibc started passing (argc, argv, envp) to the DSO init functions in 1996: https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=elf/dl-open.c;h=76f6329762308de4ba1620c50ff32d2c02359766;hp=40b52247253cf045498761342afd09ba3c7e1187;hb=dcf0671d905200c449f92ead6cf43c184637a0d5;hpb=4884d0f03c5a3b3d2459655e76fa2d0684d389dc

So at least we don't need to worry about glibc version detection.

I just ran into this issue while trying to help my partner debug a crash when using a golang plugin they'd written for fluentbit (https://github.com/fluent/fluent-bit).

When running on Alpine Linux which uses musl, I get a segfault inside of runtime.sysargs due to a bad argv pointer. The cause of this is as mentioned above, when the plugin is dlopened, musl does not pass any arguments to the members of DT_INIT_ARRAY. The platform-specific init function (_rt0_amd64_linux_lib in my case) assumes it's being passed a valid argc / argv, and eventually segfaults as they are not.

After finding this issue, it seems that the way forward is to:

  • update each platform-specific init file to contain an empty argument vector, and update _rt0_*_lib_argv to point at this empty vector
  • update each platform-specific init function to only overwrite the default argv pointer with the incoming arguments if glibc is detected

Can anyone comment if testing for glibc would still be desired or not?

The same problem appears when using c-archive build mode with musl.

If there is a reliable way to test for glibc, then I think it would be perfectly reasonable to do that. @mdempsky 's comment above suggests a way.

Unassigning because I'm not planning to work on this, but still happy to review if anyone has suggestions.

@mdempsky I started work on the x64 / x86 / arm / arm64 versions the other night, but it was becoming quite time consuming to test. I have docker images running qemu, which are bootstrapped with a cross-compiled toolchain from my host using bootstrap.sh, but then recompile the toolchain locally inside of qemu for CGO support. Is there a better / faster way to test across the various targets?

Also, do you have any recommendations for getting access to the above pre-processor defines from the platform-specific assembly files? Is it possible, or would they need to call into some C function to do the work, e.g.:

void override_args(int argc, char *argv, int *argc_out, char *argv_out) {
#if GLIBC
  *argc_out = argc;
  *argv_out = argv;
#endif
}

@inolen I think calling out to a C function (or Go function using cgo) in runtime/cgo is probably simplest/cleanest. Even if POSIX guaranteed there was a system header that could be safely #include'd into non-C files, cmd/asm doesn't support the full C preprocessor language.

We could potentially have cmd/dist detect glibc and generate cmd/asm-compatible .h files, but then we need to re-run make.bash depending on target libc, which seems unfortunate.

Lastly, sorry, I don't have any good solution to efficiently testing either. That's a contributing factor to why I haven't gotten around to it yet. :/

CL https://golang.org/cl/37868 mentions this issue.

@mdempsky pushed review to https://go-review.googlesource.com/c/37868/

Initially, I tried the approach I mentioned above, but setting up the default arguments / calling into the cgo function in each platform-specific assembler function became a lot of code. After digging around the runtime code more, I found the islibrary and isarchive bools which let me fix this outside of each platform's library init routines.

I'm not sure if the code allocating the default arguments is correct. I read through https://github.com/golang/go/blob/master/src/runtime/HACKING.md and I think using persitentalloc is sane for this case, but perhaps I need to use sysAlloc and setup the appropriate terminate functions to free it back up.

No new tests were added, as the old testcarchive / testcshared tests failed when using musl. However I did setup a few scripts which ran the golang library tests for various os / arch combinations through docker / binfmt_misc / qemu here:
https://gist.github.com/inolen/499da4e40a866b3f8fa5be3635d78721

I'm wondering if Go-on-musl has the same problem as Go-on-Bionic (#29674) where it wants to allocate a word of static TLS memory, but if the Go code is packaged into an solib and loaded with dlopen, there's no reliable way to do so.

musl appears to ignore DF_STATIC_TLS and allow TPREL/TPOFF relocations to a TLS symbol in a dlopen'ed solib. AFAICT, the relocations are valid only for new threads -- if any existing thread uses the TLS IE relocation to access a runtime.tlsg TLS symbol, it would access unallocated memory.

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 400]
runtime.sysargs (argc=0, argv=0x0) at /usr/local/go/src/runtime/os_linux.go:206

arm-hisiv500-linux-uclibcgnueabi-gcc

Since this topic has been open for a while, I was wondering if there is any news about it, or any suggested workarounds?
I'm trying to use a c-shared go library in a docker container based on Alpine Linux. My application is in Java and uses the lib through jnr-ffi. It works on other distributions, but on Alpine Linux it gives me this error:
java.lang.UnsatisfiedLinkError: Error relocating /usr/local/lib/liblicense2go-client.so: : initial-exec TLS resolves to dynamic definition in /usr/local/lib/liblicense2go-client.so at jnr.ffi.provider.jffi.NativeLibrary.loadNativeLibraries(NativeLibrary.java:87) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.provider.jffi.NativeLibrary.getNativeLibraries(NativeLibrary.java:70) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.provider.jffi.NativeLibrary.getSymbolAddress(NativeLibrary.java:49) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.provider.jffi.NativeLibrary.findSymbolAddress(NativeLibrary.java:59) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.provider.jffi.AsmLibraryLoader.generateInterfaceImpl(AsmLibraryLoader.java:158) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.provider.jffi.AsmLibraryLoader.loadLibrary(AsmLibraryLoader.java:89) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.provider.jffi.NativeLibraryLoader.loadLibrary(NativeLibraryLoader.java:44) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.LibraryLoader.load(LibraryLoader.java:325) ~[jnr-ffi-2.1.9.jar!/:na] at jnr.ffi.LibraryLoader.load(LibraryLoader.java:304) ~[jnr-ffi-2.1.9.jar!/:na]

I also tried to link the c-shared go library to a c++ binary compiled on Alpine Linux, and it gives me a Segmentation Fault.

If I compile the same go code as an executable it runs nicely in the Alpine docker container.
Thanks

The initial comment on this issue explains the problem. Someone will need to fix it in the Go runtime package. I'm not aware of any workarounds.

Your Java link error seems like a different problem, though.

For info: using dlopen in c++ to open the library dynamically, rather than linking against it, gives me the same error as with the Java JNI:
initial-exec TLS resolves to dynamic definition

duzy commented

I have the the same Segmentation Fault too. It happens as long as the C program links to the Go shared library (c-shared mode).

go version go1.14beta1 linux/amd64 and

go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build862575220=/tmp/go-build -gno-record-gcc-switches"

Easy fix: abort processing args if argv is null.

Real fix: Go libraries should have an empty []Argv and the programmer should be required to pass things into the library if they want these to be known. Libraries aren't main() and Go here is depending on a glibc quirk.

MUSL's position here is a bit dogmatic but probably not technically wrong. Shared libraries depending on access to argc/argv is sort of outside the C spec and gets into undefined territory. The only function that receives those is main(), and if main() does not pass them they aren't "known" elsewhere.

MUSL's position here is a bit dogmatic but probably not technically wrong.

It's not dogmatic, this glibc extension isn't part of the standardized ABI or documented anywhere; glibc docs and gcc docs don't mention its existence. No libc other than glibc implements this.

Easy fix: abort processing args if argv is null.

Probably works in most cases, but breaks if any of the registers have garbage in them when the function is called, which afaik is allowed.

For what it's worth, I believe that FreeBSD does this also nowadays: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=249162.

Change https://golang.org/cl/334991 mentions this issue: runtime: add check before using arguments with -buildmode=c-archive and -buildmode=c-shared on non glibc systems such as musl/uclinux

What is blocking this since 6 years? Is it that no-one has time to work on it, or is there simply no good idea how to fix it, because the glibc dependency is too entangled with Go?

Also, is there any workaround to compile a Go shared object that is loadable with dlopen on MUSL systems? I would even make my own Go runtime branch as a last resort, if in the end I am able to build Alpine containers that able to load my c-shared Go library as a plugin.

For what it's worth, compiled with -buildmode=c-shared I get:

sc_dlopen failed: Error relocating ./foobar.so: (null): initial-exec TLS resolves to dynamic definition in ./foobar.so

with

$ go version
go version go1.17.5 linux/amd64
$ cat /etc/alpine-release 
3.15.0
$ apk version musl
Installed:                                Available:
musl-1.2.2-r7                           = 1.2.2-r7 

@ansiwen Most of the discussion on this issue is about whether a shared library can get access to argc/argv. As far as I know that is impossible when using MUSL and on many other non-glibc systems. I don't know of any good workaround there, although it would be best if we simply leave os.Args as a nil slice in that case. Perhaps that already happens, I don't know. I see that there is a change that is meant to address that, but it is pending review (https://golang.org/cl/334991).

The TLS problem you mention looks different. I don't know what is going on with that. It most likely required a fix in the Go linker.

Is it that no-one has time to work on it, or is there simply no good idea how to fix it, because the glibc dependency is too entangled with Go?

I would say it's more that the core Go team would look to people using MUSL to fix a problem like this. Go is an open source project and in general the core Go team puts relatively little time into non-first-class ports (the first class ports are listed at https://golang.org/wiki/PortingPolicy).

Thanks again for you answer @ianlancetaylor, you're remarkably reactive. ๐Ÿ‘

The TLS problem you mention looks different. I don't know what is going on with that. It most likely required a fix in the Go linker.

Ok, I thought these issues are related, since I saw a few mentions of TLS in this discussion. Do you suggest to open a separate issue?

Is it that no-one has time to work on it, or is there simply no good idea how to fix it, because the glibc dependency is too entangled with Go?

I would say it's more that the core Go team would look to people using MUSL to fix a problem like this. Go is an open source project and in general the core Go team puts relatively little time into non-first-class ports (the first class ports are listed at https://golang.org/wiki/PortingPolicy).

The porting policy doesn't mention any specific libc implementations for linux/*, so it's not clear from it, that no-glibc-linux systems are non-first-class ports.

I would love to help out, but unfortunately I have no idea what the problem here is. My experience is limited to the building of alpine based containers.

It seems like MUSL used to crash, when dlopen was used to load a shared object that contained inital-exec references to dynamic TLS. This got fixed in a way that it now bails out with a proper error: http://git.musl-libc.org/cgit/musl/commit/?id=5c2f46a214fceeee3c3e41700c51415e0a4f1acd

So, as far a I understand dlopen is only compatible with initial-exec if static TLS is used. Would it be feasible to implement a Go buildmode like "c-plugin" that uses static TLS? As I said, I have barely knowledge about that topic and feel like a blind who tries to describe colors. ๐Ÿ˜…

I do think that the TLS problem would be better handled in a separate issue.

The porting policy doesn't mention any specific libc implementations for linux/*, so it's not clear from it, that no-glibc-linux systems are non-first-class ports.

Good point, I added a sentence to the page.

It seems like MUSL used to crash, when dlopen was used to load a shared object that contained inital-exec references to dynamic TLS.

I don't understand why the shared object contains initial-exec TLS uses. I think that would be the thing to fix. That would benefit all cases.

fenos commented

I also came across to initial-exec TLS resolves to dynamic definition when loading a c-archive library on alpine.
Is there a way to work around it?

@fenos

I also came across to initial-exec TLS resolves to dynamic definition when loading a c-archive library on alpine. Is there a way to work around it?

I'm not aware of any, but I am also still interested in it. I reverted to use debian images instead of alpine, which is unfortunate.

Sfinx commented

heh, 7 years old go bug still strikes - go plugin for fluent-bit just do not starts:

bash-5.1# ./bin/fluent-bit -e plugins/out_grafana_loki.so 
[2022/07/18 21:09:47] [ info] [config] changing coro_stack_size from 3072 to 4096 bytes
[proxy] error opening plugin plugins/out_grafana_loki.so: 'Error relocating plugins/out_grafana_loki.so: (null): initial-exec TLS resolves to dynamic definition in plugins/out_grafana_loki.so'
[2022/07/18 21:09:47] [error] [plugin] error loading proxy plugin: plugins/out_grafana_loki.so

I've created #54805 to track the initial-exec TLS resolves to dynamic definition issue.

Hi, I just created a patch for gcc-go that avoids this problem just skipping goargs() and goenv() when build as c-[shared|archive]. Essentially:

diff --git a/libgo/go/runtime/proc.go b/libgo/go/runtime/proc.go
index 881793b..52534ba 100644
--- a/libgo/go/runtime/proc.go
+++ b/libgo/go/runtime/proc.go
@@ -692,9 +692,11 @@ func schedinit() {
 		throw("sched.timeToRun not aligned to 8 bytes")
 	}
 
-	goargs()
-	goenvs()
-	parsedebugvars()
+	if !isarchive && !islibrary {
+		goargs()
+		goenvs()
+		parsedebugvars()
+	}
 	gcinit()

It works properly with libs that don't rely/need this data like https://github.com/hoehermann/purple-gowhatsapp/ but probably will affect others that use it in their logic.

In the aim of get a better integration of musl and go, would you be willing to change this glibc-ism and enforce that when building on c-archive or c-shared args and env are not accesible? There are plenty ways for passing the needed info.

If yes I can submit a PR, currently it's based on gcc-go so probably it needs some changes.

@donob4n That approach means that os.Args won't work in Go code build with -buildmode=c-archive or c-shared. While the current situation is not ideal, that situation is also not ideal. And, as you say, may break existing working programs. It doesn't better overall than the current situation.

While the current situation is not ideal

The current situation is that it crashes dereferencing a pointer that came from interpreting garbage on the stack or in a register as a pointer, so anything that doesn't crash from that seems like an improvement.

If there are libraries written in Go that are trying to interpret the main program's initial arguments, or other random data left there after the main program overwrote that storage, this is surely a bug that needs to be identified and fixed. It's very intentional that musl does not provide these arguments to ctors because (1) it's nonstandard functionality with no means to detect its presence, meaning the only way you can use it is by writing nonportable code that commits UB when the functionality isn't available, and (2) it's functionality whose only purpose is to write library-unsafe library code that peeks (or worse, pokes) at data that doesn't belong to it.

From my perspective, the current situation is that these libraries work as expected when using glibc, which is the majority of Linux systems. I respect that musl has adopted a different approach, and I certainly think we should support that if we can figure out how. But while I don't know of any good answer, I don't think the approach of breaking existing code that works when using glibc is the best one.

If the problem is breaking working code for glibc it could only skip 'goargs()' when no-glibc lib is detected (or at least musl).

This way the affected apps will only fail when someone tries to build them with a non-glibc and hopefully he will found the problem and report to upstrream.

Since it needs some hooks on sysargs() and other funcs, it could also write some debug warning like "Trying to read args but built as c-xxxx".

Is there a reasonable way for a Go archive to know whether it is being run on a glibc or a musl system?

Do you mean in runtime? There is a GLIBC macro that can be used when building.

The runtime is not written in C. Building a pure Go program must not require a C compiler.

I don't understand the question. If you're building a Go program which is pure Go interfacing with the underlying target's syscall layer only, with no C libs, and providing its own entry point, then of course you can do whatever you want.

My understanding is that we're talking about build modes where code is being built to be linked with C library code, or as a library that's loadable into C programs. (Here by "C" I'm being a little bit sloppy, but the meaning is programs built on the underlying host C implementation and that need to be compatible with it.)

In that case, the only place the arguments are available is via a local pointer object belonging to main, and only the current environment (via extern char **environ or getenv etc.), not the initial environment with aux vector attached, is available to inspect. So Go code running in a context where it was not in a position to provide main and save the args, and where no contract was made with the calling code to pass along args, simply does not have access to the args. If glibc and some other environments provide a way to get this (but note: they don't actually document what they provide, and since these objects "belong to" main, it's possible that a glibc-linked program will already have clobbered them by the time the go library code tries to inspect them) then I guess you can access that conditional on knowing you're on an implementation that provides that. musl specifically does not do this.

The question is not whether you're running on a musl-based host but whether you're linked to a host libc environment that provides these nonstandard interfaces.

@richfelker You probably know this, but just for clarity for all. We are talking about -buildmode=c-shared. The resulting shared library will be linked with a C program. But the shared library itself may be pure Go code. (Sorry for saying "a pure Go program" above; that was a misleading way to describe the scenario.)

I agree that what really matters is some way for the shared library to determine whether the dynamic linker is passing the argc/argv values to the DT_INIT_ARRAY constructor functions.

It is not my place to suggest musl changes, but given the existing glibc behavior, behavior that some projects other than Go also expect, would it be completely unreasonable for musl to clear the first three arguments passed to DT_INIT_ARRAY functions, so that those functions would at least see zero values rather than garbage?

would it be completely unreasonable for musl to clear the first three arguments passed to DT_INIT_ARRAY functions, so that those functions would at least see zero values rather than garbage?

Yes. Specifically, it would invoke UB. The specification for the ctor functions in the init array is that they take no arguments. If we called them with arguments, that's a call where the signature of the callee mismatches the function type used to call it, and the behavior is undefined. This actually matters if you want to someday support fancy ABIs that can do runtime CFI-like checks. Establishing a norm that we support this usage, even just by passing zeroed args, precludes the possibility to do anything like that.

Given the popularity and benefits of Alpine Linux in containerized applications, the dependency on non-standard / glibc-specific behaviors makes advocating for the use of Golang difficult. If we're to describe these issues in the terms of "first class ports" versus non-first class ports, it might make more sense to rename the ports linux/* to something like linux/glibc/* to indicate that it's only glibc based Linux distros that are first class supported.

The ports page https://go.dev/wiki/PortingPolicy does already say that only glibc ports are considered first class.

I think @jgowdy's subtext was: please reconsider that musl is not first class. It's kind of ironic that the main drivers for container deployments and the popularity of Go (docker, kubernetes, ...) are written in Go, but the the majority(?) of their payloads run on a "non-first-class" operating system (alpine linux).

Fair enough, sorry for the misunderstanding. This issue is not the place for that discussion, though.

Running into this issue as well when trying to build a Go library that's statically linked. It'd be nice if musl could be considered a first-class platform because Alpine Linux is an extremely common OS for cloud purposes.

Randomly came across this. In 2015 @mdempsky suggested attempting to resolve gnu_get_libc_version (#13492 (comment)), but with the concern that musl might implement this, breaking this detection method. In the intervening 9 years, it appears as if musl has not done this (as far as I can tell). Should we just to use this method?

Change https://go.dev/cl/610837 mentions this issue: runtime: fix segfault due to missing argv on musl-linux c-archive