JuliaLang/julia

Repeated calls to LibGit2.fetch segfaults on FreeBSD 11.1

ararslan opened this issue · 39 comments

I upgraded my FreeBSD box from 11.0 (which is what the FreeBSD CI workers are running) to 11.1, and now I'm consistently getting segfaults in the Pkg tests:

$ JULIA_CPU_CORES=2 JULIA_TEST_MAXRSS_MB=600 ./julia test/runtests.jl pkg
Test (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
INFO: Initializing package repository /tmp/T22S8Ijy/v0.7
INFO: Cloning METADATA from https://github.com/JuliaLang/METADATA.jl
INFO: No packages to install, update or remove
INFO: Cloning cache of Example from notarealprotocol://github.com/JuliaLang/Example.jl.git
INFO: Cloning cache of Example from https://github.com/JuliaLang/Example.jl.git
INFO: Installing Example v0.4.1
INFO: Package database updated
INFO: Checking out Example master...
INFO: Pulling Example latest master...
INFO: No packages to install, update or remove
INFO: Freeing Example
INFO: No packages to install, update or remove
INFO: Checking out Example master...
INFO: Pulling Example latest master...
INFO: No packages to install, update or remove
INFO: Freeing Example
INFO: No packages to install, update or remove
INFO: Removing Example v0.4.1
INFO: Package database updated
INFO: Package Example is not installed
INFO: Cloning Example from https://github.com/JuliaLang/Example.jl.git
INFO: Computing changes...
INFO: No packages to install, update or remove
INFO: Package database updated
INFO: Freeing Example
INFO: No packages to install, update or remove
INFO: Checking out Example master...
INFO: Pulling Example latest master...
INFO: No packages to install, update or remove
INFO: Freeing Example
INFO: No packages to install, update or remove
INFO: Cloning Example2 from /tmp/T22S8Ijy/v0.7/Example
INFO: Computing changes...
INFO: No packages to install, update or remove
INFO: Cloning Example3 from /tmp/T22S8Ijy/v0.7/Example
INFO: Computing changes...
INFO: No packages to install, update or remove
INFO: Checking out Example2 test-branch-1...
INFO: Pulling Example2 latest test-branch-1...
INFO: No packages to install, update or remove
INFO: Checking out Example3 test-branch-1...
INFO: Pulling Example3 latest test-branch-1...
INFO: No packages to install, update or remove
INFO: Checking out Example master...
INFO: Pulling Example latest master...
INFO: No packages to install, update or remove
INFO: Cloning Example4 from /tmp/T22S8Ijy/v0.7/Example
INFO: Computing changes...
INFO: No packages to install, update or remove
INFO: Checking out Example4 test-branch-2...
INFO: Pulling Example4 latest test-branch-2...
INFO: No packages to install, update or remove
[1]    2356 segmentation fault (core dumped)  JULIA_CPU_CORES=2 JULIA_TEST_MAXRSS_MB=600 ./julia test/runtests.jl pkg

Version info:

julia> versioninfo()
Julia Version 0.7.0-DEV.1383
Commit d126c66a9e* (2017-08-18 16:00 UTC)
Platform Info:
  OS: FreeBSD (x86_64-unknown-freebsd11.1)
  CPU: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, sandybridge)
Environment:

cc @iblis17

Running it in GDB, I'm seeing a lot of dwarf errors regarding "wrong version in compilation unit header," as well as

Program received signal SIGSEV, Segmentation fault.
0x0000000801571c19 in ?? () /usr/local/lib/gcc5/libgcc_s.so.1

at the end. Full log here: https://gist.github.com/ararslan/a9694acc54da13f633edde6aa7230a59

It also left a core file in the package directory where the tests were running. Examining that with GDB, I get

(gdb) core julia.core
Core was generated by `./julia test/runtests.jl pkg'.
Program was terminated with signal 11, Segmentation fault.
#0  0x0000000803e75907 in ?? ()
(gdb) bt
Cannot access memory at address 0x7fffdfbfdd80

Wow, the system GDB is way out of date, no wonder they're going to remove it. Okay, I tried again with the ports GDB. The output is much more informative, but it looks like it's stopping on something else; the test program doesn't get nearly as far before stopping, and it doesn't appear to be hitting a segfault in GDB. The full log is here: https://gist.github.com/ararslan/4347c22e925e98f5fcacda5b438854a9. Could be I'm misusing GDB somehow, so if anything about the session in the log looks fishy let me know and I can try it another way.

Using LLDB rather than GDB looks like it hits the right thing: https://gist.github.com/ararslan/acc5e587400affd93dff5e41f5dc1cda. (System LLDB, 4.0.0 in FreeBSD 11.1.)

Looks like it's this line that's hitting it: https://github.com/JuliaLang/julia/blob/master/test/pkg.jl#L298

@test_warn "INFO: Package Example: skipping update (pinned)..." Pkg.update()

I haven't been able to minimally reduce the issue yet though.

Found a workaround, can you confirm?:

# sysctl security.bsd.stack_guard_page=0

It appears that this option is enabled in FreeBSD 11.1-Release by default.

Interesting, it does appear that stack_guard_page is 0 in 11.0 and 1 in 11.1, though that isn't mentioned in the 11.1 release notes. Disabling the stack guard page allows the Pkg tests to pass without segfaulting.

not in release notes, but in wiki: https://wiki.freebsd.org/WhatsNew/FreeBSD11#Security

The stack protector is now set to strong (r288669)

https://svnweb.freebsd.org/base?view=revision&revision=288669

Edit: seems unrelated

--
I found this commit https://svnweb.freebsd.org/base?view=revision&revision=215307

this commit enable it by default https://svnweb.freebsd.org/base?view=revision&revision=320317

Okay, even better LLDB backtrace: https://gist.github.com/ararslan/62d5dfb03b529a56dce0ab5d239685d8. In particular, it shows libgit2 calls in thread backtrace all starting on line 112.

Edit: Buuuuuuuuut that call is in the wrong thread. :/ Thread 6 in the above gist hits the SIGSEGV.

Similar issue? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221127#c12

I'm going to give flang a try

I doubt it's an issue with conflicting libgcc_s between Clang and GCC since we have the build system set up on FreeBSD to link everything to GCC's libgcc_s. That's what I implemented in #21788. It must be an issue with the stack guard setting, since the tests pass with it disabled and segfault with it enabled.

on 11.0-RELEASE-p12 the test-pkg passed

root@:~/julia # sysctl security.bsd.stack_guard_page
security.bsd.stack_guard_page: 1
root@:~/julia # uname -a
FreeBSD  11.0-RELEASE-p12 FreeBSD 11.0-RELEASE-p12 #0: Wed Aug  9 10:03:39 UTC 2017     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

I think it's releated to r320317. It does change something besides setting stack guard as default; 11.0-RELEASE doesn't include this patch, so test-pkg passed with stack guard.

Excellent catch!

Ok, maybe we should create an minimal example, and send it to FreeBSD CURRENT mailing list.
The problematic part is libgit2 (perhaps plus unwind) right? I'm going to start from building simple c code.

I'm not sure the problem is libgit2; in the LLDB backtrace that contains libgit2 calls, the libgit2 calls are in a different thread than the one that hits the SIGSEGV. Do you know how to reproduce this minimally? I'm still trying to figure out what in Julia is actually triggering it.

😖
I managed to remove some test case in test/pkg.jl but that will make testing pass with 11.1's stack guard.
The condition to triggering SIGSEGV seems quite tricky.

Which tests did you have to remove in test/pkg.jl to get the tests passing? Can you post a diff?

please checkout this: https://gist.github.com/iblis17/e2199735aee9673585da8aa48e5d4984

In the original pkg.jl, the explosion point of SIGSEGV is near line 300, IIRC.
when I comment out some test cases like my gist, the test-pkg will pass. 😕

For record, segfault happened on -CURRENT with stack guard enabled, also.

└─[iblis@abeing]% uname -a
FreeBSD abeing 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r323335: Sun Sep 17 00:56:35 CST 2017     root@abeing:/usr/obj/usr/src/sys/GENERIC  amd64

another reproduce steps:

  1. git clone https://github.com/JuliaLang/julia.git repo
  2. cd repo && git reset --hard 67731c2c07 (It's just HEAD~500) edit: ignore this
  3. cat test.jl
for i  1:100
    info(i)
    LibGit2.fetch(LibGit2.GitRepo("./repo"))
end
  1. ./julia test.jl
└─[iblis@abeing]% ./julia test.jl
INFO: 1                           
INFO: 2                           
INFO: 3                           
INFO: 4                           
INFO: 5                           
INFO: 6                           
zsh: segmentation fault (core dumped)  ./julia test.jl

Good sleuthing! If you write the same thing in C using the functions from libgit2, do you also get a segfault? Also, why is the git reset necessary? Is it just so that HEAD doesn't point to the latest commit or is there something specific about 67731c2?

I checked it again, git reset is unnecessary.
I try to invoke libgit2 api in C: https://gist.github.com/iblis17/bbc621a78fda6ffcbca077fadba8ecdd#file-git2_fetch-c
but cannot get segfault.

I was able to reduce it to a single ccall:

import Base.LibGit2: GitRepo, GitRemote, RemoteCallbacks, CredentialPayload,
                     StrArrayStruct, FetchOptions, get, credentials_cb

repo = GitRepo("./repo")
rmt = get(GitRemote, repo, "origin")
fo = FetchOptions(callbacks=RemoteCallbacks(credentials_cb(), CredentialPayload()))

for i = 1:100
    info(i)
    ccall((:git_remote_fetch, :libgit2), Cint,
          (Ptr{Void}, Ptr{StrArrayStruct}, Ptr{FetchOptions}, Cstring),
          rmt.ptr, C_NULL, Ref(fo), "hi")
end

close(rmt)
close(repo)

For me it consistently faults on the sixth call, as it is in your example output above.

Is it possible that one of the structs has the wrong specification?

I compared our implementations in base/libgit2/types.jl to the documentation of the corresponding types in libgit2 and they seem to match as far as I can tell, but there may be some subtle difference that I'm missing.

I'm going to call this a Julia bug since I haven't been able to reproduce it with C or Rust's libgit2 bindings.

@omus has noted that this seems only to occur when using multiple cores. That is, setting JULIA_CPU_CORES=1 avoids segfaulting.

Hm, setting JULIA_CPU_CORES=1 does not avoid segfaulting for me. I guess it's only when the actual VM is set to only use one core.

My guess that something happens during the credentials_cb call. It would be hard to track (it is Julia-C-Julia call) and it would segfault without any relevant information.

Why would something inside credentials_cb, which is outside of the loop, cause the repeated ccall to git_remote_fetch to segfault? You mean like the FetchOptions ends up being constructed incorrectly or something, and messes things up when being passed back and forth between Julia and C?

Okay, updated LLDB output on latest master with the above git_remote_fetch in a loop: https://gist.github.com/ararslan/3eb7df6f83d21242d5e6d53719ff2efc

LLDB backtrace on FreeBSD 12.0-CURRENT with security.bsd.stack_guard_page=1, built with DISABLE_LIBUNWIND=1: https://gist.github.com/iblis17/b9eb213150b1da48a46c460bd310187b

Plot twist: libgit2 may be a red herring. Setting USE_SYSTEM_CURL=1 allows things to work fine without segfaulting. I tried applying all of the curl patches in the Ports tree to our curl but that didn't do it. A difference between the system curl and ours is that the system's is built with OpenSSL while ours is built with mbedTLS, so I'm inclined to think that could be related.

I tried copying my /usr/local/lib/libcurl.so to julia/usr/lib/, but I still got segfault. :/

Using the system curl with USE_SYSTEM_CURL=1 no longer fixes this for me.

Never mind, it does—I just had to rebuild libgit2 and mbedTLS.

I'm not getting a segfault with a stock build of Julia on FreeBSD 12.0-CURRENT, built from source at r326614 with the GENERIC-NODEBUG kernel and MALLOC_PRODUCTION. The stack guard page is enabled.

I just tried current master on FreeBSD 11.1 and it didn't segfault. I'm doing a git clean -fdx and rebuilding from scratch to make sure it's reproducible.

Neither Iblis nor I can reproduce this on 11.1 with current master, so I'm going to close this and call it resolved. I have no idea what changed in Base that made it work, though I might bisect it out of curiosity.