runtime: remove assumptions on Android Bionic's TLS layout
rprichard opened this issue · 56 comments
What version of Go are you using (go version
)?
$ go version go version devel +67164c3014 Fri Jan 4 20:25:45 2019 -0800 linux/amd64
Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (go env
)?
Android
go env
Output
$ go env GOARCH="amd64" GOBIN="" GOCACHE="/usr/local/google/home/rprichard/.cache/go-build" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/usr/local/google/home/rprichard/go" GOPROXY="" GORACE="" GOROOT="/x/golang/go" GOTMPDIR="" GOTOOLDIR="/x/golang/go/pkg/tool/linux_amd64" GCCGO="gccgo" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build009028863=/tmp/go-build -gno-record-gcc-switches"
Details
I posted about this issue earlier on golang-dev:
https://groups.google.com/forum/#!msg/golang-dev/yVrkFnYrYPE/2G3aFzYqBgAJ
I'm working on adding ELF TLS support to Android's libc (Bionic), and Go is making assumptions about Android's TLS memory layout that a future version of Android might break. Specifically:
-
On Android/{arm,arm64}, Go saves and restores its g register to a pthread key, which it accesses directly using the thread pointer. Go allocates a pthread key and assumes that it can find it at a positive offset, within 384 words after the thread pointer. ELF TLS on arm/arm64 allocates the space after the thread pointer to PT_TLS segments instead, which can be arbitrarily large. For maximum efficiency, Bionic might prefer to allocate the pthread keys at a negative offset from the TP, known at compile-time, but then Go can't find them.
-
On Android/{386,amd64}, Go uses %gs:0xf8/%fs:1d0 for its g register and assumes it can allocate the location by repeatedly calling
pthread_key_create
until it finds a key such thatpthread_setspecific
modifies the desired location. This assumption breaks if Bionic has an even number of TLS slots, because then Go's fixed location is allocated to apthread_key_data_t::seq
field rather than apthread_key_data_t::data
field.
I suspect Go is assuming that its pthread-allocated word is initialized to zero, but that's not generally the case when pthread keys are recycled. Bionic's pthread_getspecific
lazily initializes a key to zero. (It'd be great if I could get confirmation on this point.)
Android may be able to keep Go working, but it would be better if Go had a reliable interface for accessing its TLS memory.
There's a simple way to fix arm/arm64 by adding an API to Bionic that allocates a single word of static TLS memory at a fixed offset. For example, I could add something like this to Bionic:
int pthread_alloc_static_tls_word_np(intptr_t* tpoff, void** tlsbase)
(https://android-review.googlesource.com/c/platform/bionic/+/862083)
gcc_android_{arm,arm64}.c
could then look for the new API with dlopen
and fall back to the current allocation behavior if the API is missing.
386/amd64 is harder to fix. Two possibilities:
- It could use the API above, then use something resembling a TLS IE access in each function prologue. (Something like this: https://godbolt.org/z/5Cfw7S.)
- Switch amd64 to an ARM-like design: pick a GPR to hold g, then save/restore the g register to memory allocated with the above API. Drop support for Android/386.
For the ARM design, I'm wondering to what extent Go could use pthread_getspecific
or dynamic TLS (whether __tls_get_addr
or TLSDESC) to load the g register. Possible downsides:
- Might be noticeably slower.
- Neither system is guaranteed to be async-signal safe. Perhaps Bionic could guarantee AS-safety.
- Dynamic TLS could use a lot of stack if it needs to allocate heap memory. Perhaps Go can guarantee a large stack in situations where the thread's TLS var isn't allocated yet.
I could upload a Go patch demonstrating arm/arm64 use of the API I proposed.
Thank you for working on this. I'll page someone more knowledgable on the runtime than myself: @randall77 @ianlancetaylor.
What are the chances that x86 Bionic could implement TLS relocations as they are implemented on x86 GNU/Linux systems? Because then I think everything would just work.
Reserving a register on amd64 would break all existing assembly code, which means breaking a lot of crypto packages. I think that would be difficult. Though the ABI support in 1.12 might provide a stepping stone toward supporting that.
I don't fully understand your TLS IE example for amd64. It seems to assume that the offset from %fs
can be determined at run time, but as far as I know it must be known at compile time. I think that handling a variable offset from %fs
would require a longer instruction sequence.
In glibc, certain offsets from %fs
are reserved for certain uses. See tcbhead_t
in https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/nptl/tls.h;h=e25430a92826855f62617c5b51c65cb8d194f898;hb=HEAD . For example, that is where gccgo stores its split-stack information. Is there any chance that Bionic could give Go a similar pointer-sized word?
What are the chances that x86 Bionic could implement TLS relocations as they are implemented on x86 GNU/Linux systems? Because then I think everything would just work.
Implementing the relocations is straightforward. The difficulty is that Go always uses static TLS ("IE" accesses), even though an Android app's solib is loaded with dlopen. Using IE in an solib doesn't generally work, because libc needs to allocate a thread's static TLS memory before it starts running, and it can't move/extend that memory when dlopen is called later. It can work if the libc reserves enough surplus static TLS memory. glibc reserves space (TLS_STATIC_SURPLUS
and DTV_SURPLUS
), and at least some BSDs also do (RTLD_STATIC_TLS_EXTRA
). The reserved amounts I've seen vary between about 64 and ~1600 bytes.
The surplus amount needs to be enough for all dlopen'ed solibs. Once it's exhausted, dlopen fails.
There are other problems with surplus static TLS involving (a) initialization and (b) the DF_STATIC_TLS
flag. Maybe we'd limit support to zero-initialized TLS segments and solibs marked with DF_1_NODELETE
.
I don't fully understand your TLS IE example for amd64. It seems to assume that the offset from %fs can be determined at run time, but as far as I know it must be known at compile time. I think that handling a variable offset from %fs would require a longer instruction sequence.
On GNU/Linux, Go uses a TLS LE access in an executable, and each access is a single instruction, e.g.:
402e10: 64 48 8b 0c 25 f8 ff mov %fs:0xfffffffffffffff8,%rcx
The offset is known at build-time and encoded into the instruction. Typically the static linker would encode the offset, but I think Go's compiler knows that the runtime.tlsg symbol is at offset -8 and can encode it earlier.
In a GNU/Linux shared object (-buildmode=c-shared
), Go instead uses TLS IE, where the offset is known at load-time, e.g.:
80250: 48 8b 0d 79 9d 39 00 mov 0x399d79(%rip),%rcx # 419fd0 <.got>
80257: 64 48 8b 09 mov %fs:(%rcx),%rcx
...
0000000000419fd0 0000000000000012 R_X86_64_TPOFF64 0
The dynamic loader computes the TPOFF value for the solib's TLS segment and writes it into the GOT.
My proposal is to use something resembling TLS IE, but instead of storing the TPOFF in the GOT at load-time, it's computed at run-time and stored in an ordinary STB_LOCAL symbol.
In glibc, certain offsets from %fs are reserved for certain uses. ... Is there any chance that Bionic could give Go a similar pointer-sized word?
We could allocate a TLS slot that Go could use, but I think we're hesitant to give one to a particular language runtime if we can avoid it.
If we did allocate a slot today, Go would only be able to use it on new versions of Android. That's problematic for Android/x86 because it uses TLS LE in the app solib, e.g.:
9ca20: 64 48 8b 0c 25 d0 01 mov %fs:0x1d0,%rcx
The situation for Android/ARM is better. For ARM, Go is already computing a TP-relative offset at run-time.
My proposal is to use something resembling TLS IE, but instead of storing the TPOFF in the GOT at load-time, it's computed at run-time and stored in an ordinary STB_LOCAL symbol.
I think that I have now paged enough memory to say that this seems reasonable to me. Thanks.
Ok, I think Bionic can add an API like the one I've proposed[1], and then I can upload a patch for Go to use the API on arm/arm64. Fixing x86 is more involved, though -- maybe someone on the Go team could look at that?
I'll work on adding the Bionic API.
[1] https://android-review.googlesource.com/c/platform/bionic/+/862083
Bionic could reserve a TLS slot for use by apps: https://android-review.googlesource.com/c/platform/bionic/+/883872.
What's the status of this? The feature freeze is about a month away and it would be a shame if Android Q didn't work with Go 1.13.
@rprichard I see that your pthread_alloc_static_tls_word_np CL is not merged yet (https://android-review.googlesource.com/c/platform/bionic/+/862083). The TLS_SLOT_APP CL is (https://android-review.googlesource.com/c/platform/bionic/+/883872), but since that slot is now "Available for use by apps" and not just for Go, I'm not sure it's a good idea to depend on that.
I suspect it's too late to get the pthread_alloc_static_tls_word_np
API into Q.
FWIW, so far, I don't know of any other third-party software that's trying to allocate memory at fixed offsets from the thread pointer. Using TLS_SLOT_APP
on Android is not that different from the fixed offsets Go is using for g on Windows and Darwin for x86:
- Darwin: gs+0x18 / gs+0x30: reserved for Go and Wine (#23617)
- Windows: fs+0x14 / gs+0x28: Wikipedia describes this as "Arbitrary data slot" (https://en.wikipedia.org/wiki/Win32_Thread_Information_Block, https://codereview.appspot.com/5519054)
The "underalignment" errors on Q are related to this issue, but I think they can be fixed independently of the "how does Go access g" issue. AFAIK, the errors could be fixed by omitting the runtime.tlsg ELF symbol and the PT_TLS segment on targets that don't use them. AFAIK, those errors only happen when a Go binary is exec'ed (which I thought would be rare?).
Ok, thank you Ryan.
@ianlancetaylor @cherrymui @aclements how do we go about changing 386 and amd64 to not use static offsets on Android? amd64 is using %fs:0x1d0, 386 is using %gs:0xf8, but as far as I understand this issue, we need to use a different offset (0x02?) for Android Q.
The offset is 2 words (0x8 on x86 and 0x10 on x86-64).
Of course, silly me. Does the same apply for arm and arm64?
Yes, TLS_SLOT_APP
is at 0x8 for arm32 and 0x10 for arm64.
So which byte has to be patched to get syncthing saved from dying out on the android q emulator with x86?
Change https://golang.org/cl/169618 mentions this issue: cmd/link/internal/ld: skip TLS section on Android
@Catfriend1 please try CL 169618.
The reserved TLS_SLOT_APP slot will work only for Q, not for P or earlier. The idea suggested by @ianlancetaylor was
change the code sequence to always load the offset from a memory location, as opposed to the constant that we use today. Then on earlier versions of Android we could store the constant in that memory location. That is, always generate the same code at compile time, and check at run time to decide how to handle it.
But I guess it will be a bigger change and no one has time to deal with it.
I agree with @ianlancetaylor and @hyangah that this is probably the way forward.
Who will fill in the constants? When? How do we check which Android version we're running on?
Filling in the constants could be done in runtime/cgo/gcc_android_amd64.c just as it is done today for arm and arm64 in similar files. Detecting Android Q could be done by checking for the android_get_device_api_level API introduced in Q (we don't need to actually invoke android_get_device_api_level to know we're on Q).
I can do the above.
What I'm not so sure about is teaching the compiler and linker to fetch the TLS offset from a variable just as arm and arm64 do. Any pointers for that would be appreciated.
@eliasnaur I've tried checking out go1.12.1 from github and cherry-pick the commit 90a3ce0 . After that, I built go from source using GOROOT_BOOTSTRAP, CGO_ENABLED=0 and src/all.bat under Windows. Got a go.exe in the /bin/ directory and copied that over to the directory where normally the syncthingNative build script looks for a prebuilt-go go.exe. Let syncthing built the libsyncthing.so library for all three flavours (arm32, arm64 and x86), ran the emulator with a freshly generated APK and got the same error as before again:
I don't see the "underaligned" errors anymore. I'm working on additional changes that should help with that crash you see as well.
@rprichard, does the Android Q linker implement TLS the same as Linux? Then perhaps we could use that always, but prefill our magic offsets into the instructions in case a < Q linker will leave them as is.
does the Android Q linker implement TLS the same as Linux?
More or less. Android currently doesn't reserve surplus static TLS memory, though. Without this, an app's TLS segment might be located at a different TP-relative offset for each thread, which would require using dynamic TLS (e.g. __tls_get_addr()
) to find the TLS symbol rather than the TLS IE-like method Go sometimes uses. Static surplus memory seems somewhat common among Linux and BSD linkers, but the amount of reserved memory varies widely. It's not completely ubiquitous -- e.g. I don't think musl reserves any, so I suspect a Golang solib can't be dlopen'ed into a musl-based process.
Android linkers prior to Q ignore TLS symbols and TLS segments, but they appear to abort on the TLS relocations. It should become possible in Q to find a TLS symbol's address using dlsym
, but that's not currently implemented, and the TP-relative offset would still vary per-thread.
What I'm not so sure about is teaching the compiler and linker to fetch the TLS offset from a variable just as arm and arm64 do. Any pointers for that would be appreciated.
Look at the beginning of cmd/internal/obj/x86/obj6.go, function CanUse1InsnTLS
and progedit
. I think now on Android we'd need to use the two-instruction form, i.e. CanUse1InsnTLS
would return false. I guess in progedit
we'd need to rewrite it to something like
MOVQ sym(SB), AX
MOVQ 0(AX)(TLS*1), CX // load g into CX
with a special symbol sym
, of which runtime/cgo/gcc_android_amd64.c will fill in the content.
In cmd/compile/internal/amd64/ssa.go, the LoweredGetG
case would probably need to do the same change.
In the linker, ctxt.Tlsoffset
would need to change. As we use a variable, I guess it would just be 0.
Thank you @cherrymui, I'll get right on it.
Change https://golang.org/cl/169797 mentions this issue: runtime/cgo: remove threadentry functions specialized for android
Change https://golang.org/cl/169817 mentions this issue: cmd/link/ld,cmd/internal/obj,runtime: make the Android TLS offset dynamic
CL 169797 seems to work on amd64, but fails on 386 because the stack overflow check generates an invalid call to __x86.get_pc_thunk.*
for PIC access to the tls offset global. From objdump -d
of runtime·emptyfunc
for android/386:
0009bd40 <runtime.emptyfunc>:
9bd40: ff 15 1a 93 fa ff call *0xfffa931a
9bd46: 8b 89 92 b0 2e 00 mov 0x2eb092(%ecx),%ecx
9bd4c: 65 8b 09 mov %gs:(%ecx),%ecx
9bd4f: 3b 61 08 cmp 0x8(%ecx),%esp
9bd52: 76 01 jbe 9bd55 <runtime.emptyfunc+0x15>
9bd54: c3 ret
9bd55: e8 86 ea ff ff call 9a7e0 <runtime.morestack_noctxt>
9bd5a: eb e4 jmp 9bd40 <runtime.emptyfunc>
9bd5c: cc int3
9bd5d: cc int3
9bd5e: cc int3
9bd5f: cc int3
The linker is able to output the correct code elsewhere. From an objdump -d
of runtime·rt0_go
:
...
9a50e: e8 6d ab fa ff call 45080 <__x86.get_pc_thunk.bx>
9a513: 8b 9b c5 c8 2e 00 mov 0x2ec8c5(%ebx),%ebx
9a519: 65 8b 1b mov %gs:(%ebx),%ebx
...
Any help is appreciated.
I made 386 work with the following change:
$ git diff
diff --git a/src/cmd/internal/obj/pass.go b/src/cmd/internal/obj/pass.go
index 0c401710f6..aaff181418 100644
--- a/src/cmd/internal/obj/pass.go
+++ b/src/cmd/internal/obj/pass.go
@@ -56,7 +56,7 @@ func checkaddr(ctxt *Link, p *Prog, a *Addr) {
return
case TYPE_BRANCH, TYPE_TEXTSIZE:
- if a.Reg != 0 || a.Index != 0 || a.Scale != 0 || a.Name != 0 {
+ if a.Reg != 0 || a.Index != 0 || a.Scale != 0 /*|| a.Name != 0 */ {
break
}
return
diff --git a/src/cmd/internal/obj/x86/obj6.go b/src/cmd/internal/obj/x86/obj6.go
index ef38f6232c..12762c8e04 100644
--- a/src/cmd/internal/obj/x86/obj6.go
+++ b/src/cmd/internal/obj/x86/obj6.go
@@ -548,7 +548,7 @@ func rewriteToPcrel(ctxt *obj.Link, p *obj.Prog, newprog obj.ProgAlloc) {
q.As = obj.ACALL
thunkname := "__x86.get_pc_thunk." + strings.ToLower(rconv(int(dst)))
q.To.Sym = ctxt.LookupInit(thunkname, func(s *obj.LSym) { s.Set(obj.AttrLocal, true) })
- q.To.Type = obj.TYPE_MEM
+ q.To.Type = obj.TYPE_BRANCH
q.To.Name = obj.NAME_EXTERN
r.As = p.As
r.Scond = p.Scond
There is a comment earlier in obj6.go about the TYPE_MEM/TYPE_BRANCH and checkaddr:
// p.To.Type was set to TYPE_BRANCH above, but that makes checkaddr
// in ../pass.go complain, so set it back to TYPE_MEM here, until p2
// itself gets passed to progedit.
I don't know why checkaddr complains about a valid combination.
@eliasnaur thank you for doing this!
I don't know why checkaddr complains about a valid combination.
I think TYPE_BRANCH is for a intra-function branch, targeting a label or a relative PC (like 3(PC)
), instead of a symbol. TYPE_MEM is for a symbol.
Why changing TYPE_MEM to TYPE_BRANCH fixes this? Is there some place that handles TYPE_BRANCH but not TYPE_MEM (like generating a relocation)? I haven't got a chance to look into your previous comments. I'll take a careful look.
Ok, I think I see what happens. Now load_g_cx (called in preprocess, after checkaddr and progedit) generates an instruction (CALL __x86.get_pc_thunk.cx(SB)
) that also needs progedit. Just running progedit on all load_g_cx-generated instructions seems to work.
diff --git a/src/cmd/internal/obj/x86/obj6.go b/src/cmd/internal/obj/x86/obj6.go
index ef38f6232c..eb0e88b494 100644
--- a/src/cmd/internal/obj/x86/obj6.go
+++ b/src/cmd/internal/obj/x86/obj6.go
@@ -1014,6 +1014,7 @@ func load_g_cx(ctxt *obj.Link, p *obj.Prog, newprog obj.ProgAlloc) *obj.Prog {
progedit(ctxt, p, newprog)
for p.Link != next {
p = p.Link
+ progedit(ctxt, p, newprog)
}
if p.From.Index == REG_TLS {
Thanks, that did it!
@eliasnaur Sounds like you got it. :-) thanks for your work on this! Could you please provide a branch with go1.12.1 and the relevant commits for me to check out and test?
Change https://golang.org/cl/170117 mentions this issue: runtime/cgo: use free TLS slot on Android Q
@Catfriend1 you can use go get golang.org/dl/gotip
to build and run Go build from tip when my CLs are submitted.
Any chance to get this into the next 1.12 dot release?
@gopherbot please open a backport for 1.12
Backport issue(s) opened: #31155 (for 1.12).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.
@eliasnaur Good work! I can confirm syncthing is running fine on Android Q, x86 emulator now.
@eliasnaur can you create the backport CL per #29674 (comment) ?
Thanks
Are you sure? The linker changes are quite invasive and Android Q is in beta. Every Android version since 4.4 was released in August or later, so I hope that Go 1.13 will be out just in time for Q final.
@eliasnaur thanks for the context. We'll wait until 1.13.
Thanks for fixing this!
I see that Go is using TLS_SLOT_APP
on Android Q, which is detected by the presence of the android_get_device_api_level
symbol -- something like dlsym(dlopen(NULL, RTLD_LAZY), "android_get_device_api_level")
(https://go-review.googlesource.com/c/go/+/170117). I think this will work. Maybe it's better to load "libc.so"
instead of NULL in case something else in the program provides the symbol somehow?
For Android versions prior to Q, the Android NDK provides a static inline android_get_device_api_level
function in android/api-level.h, starting with NDK r20, but the function is static so dlsym() shouldn't ever find it.
Another option is to inline this function into the Go source. That'd be automatic with NDK r20, but with earlier NDKs, it could do something like:
#include <stdlib.h>
#include <sys/system_properties.h>
int is_Q() {
char value[PROP_VALUE_MAX];
if (__system_property_get("ro.build.version.sdk", value) < 1) {
return 0;
}
return atoi(value) >= 29;
}
Unfortunately, it looks like Q still reports SDK 28 and won't be updated for a while, which would make testing the fix difficult/impossible.
Change https://golang.org/cl/170127 mentions this issue: runtime/cgo: look for android_get_device_api_level in libc.so
Thanks, I created CL 170127 to look in libc.so directly. I'll leave the arguably more convenient NDK version to sometime after it returns the correct value.
Yeah, that change looks good to me.
Change https://golang.org/cl/170955 mentions this issue: runtime,runtime/cgo: set up TLS storage for Android Q without cgo
Which release version has this commit already on-board? go1.12.7?
This should be fixed in the upcoming 1.13 release. It is not fixed in any current release, and there are no plans to backport the various changes.
Thanks for the information. Will wait for 1.13 :)
Still have this issue after upgrade to go 1.13, could any one provide some kindly suggestion. Many thanks.
This issue is closed, so please file a new one with the error message you see, the device or emulator version, and Android version (10 final or a beta?).
FWIW, I upgraded my Pixel 1 to Android 10 yesterday and ran
https://play.google.com/apps/testing/im.scatter.app
with no problems.
Change https://golang.org/cl/200337 mentions this issue: cmd/compile: fix spurious R_TLE_LE reloc on android/386