rake-compiler/rake-compiler-dock

[Ruby 3.2] having runtime issues on darwin

flavorjones opened this issue · 40 comments

@larskanis I'm hoping you can help me out here, I'm not sure what's going wrong in the toolchain.

I've built a version of Nokogiri that precompiles Ruby 3.2 shared objects using the docker images published here. (The pipeline to build and test Nokogiri is here if you're curious, and the PR is here).

For both x86_64-darwin and arm64-darwin (the latter I run manually on my dev machine), I see this runtime error:

/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/extension.rb:7:in `require_relative': dlopen(/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle, 0x0009): Symbol not found: (_rb_cObject) (LoadError)
  Referenced from: '/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle'
  Expected in: '/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/bin/ruby' - /Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle

You can see this in the CI logs here. You can download this gem yourself if you like from the "Artifacts" section of this pipeline run:

Comparing the 3.2 dylib and the 3.1 dylib (which still works), both have _rb_cObject listed as Undefined, so I'm not actually sure the problem is in how the shared object is built. Do you have any ideas?

Interesting: when I install 3.2.0-rc1 using ruby-build then that version of Ruby is able to run this gem fine. So maybe this is a problem with how ruby/setup-ruby is building 3.2.0-rc1?

The _rb_cObject should be in libruby, but libruby isn't mentioned in the above error output. I guess it has something to do with this commit ruby/ruby@50d81bf . On Macos we use the same trick like on Linux, that unresolved symbols are automatically resolved to already loaded libraries. Therefore the nokogiri.bundle doesn't link to libruby explicit, but is still resolved at runtime.

Looks like if I build ruby with --enable-shared then this happens at runtime. When I build without that flag, ruby loads the bundle fine.

Here was a similar issue with nokogiri on Macos: sparklemotion/nokogiri#2063 (comment)

Wow, thanks for refreshing my memory, I have very little recollection of doing that work. 🤷

So at this point, I think the 3.2 bundle is fine ... do you agree?

I want to understand better what's changed about the Ruby build system, but as long as --disable-shared (which is implicit by default) is set, users should be fine. Do you agree?

Tagging @kateinoigakukun in case he spots something that's wrong with either the rake-compiler-dock build process or Ruby's makefiles.

@flavorjones Unfortunately, an extension library compiled with a --enable-shared'ed ruby is not compatible with a --disable-shared'ed ruby, and vice versa.

I think the compatibility has not been guaranteed but it just worked so far. The bundle_loader change changed the situation, so we can't assume that fat gems are portable among rubies configured with different options.

@kateinoigakukun Thanks for confirming.

@larskanis Do you know why the 3.2.0-rc1 ruby binaries installed by ruby/setup-ruby aren't compatible? I looked in the repo but nothing jumped out at me.

@flavorjones Which ruby binaries aren't compatible with what? Do you have a CI link?

@larskanis Darwin 3.2.0-rc1 binaries aren't compatible with the extensions being built by rake-compiler-dock.

CI link: https://github.com/sparklemotion/nokogiri/actions/runs/3735058984/jobs/6350173407

The CI config is:

  cruby-x86_64-darwin-install:
    needs: ["cruby-package"]
    strategy:
      fail-fast: false
      matrix:
        ruby: ["2.7", "3.0", "3.1", "3.2.0-rc1"]
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v3
        with:
          submodules: true
      - uses: ruby/setup-ruby@v1
        with:
          ruby-version: "${{matrix.ruby}}"
      - uses: actions/download-artifact@v3
        with:
          name: cruby-x86_64-darwin-gem
          path: gems
      - run: ./scripts/test-gem-install gems

and it's passing for the other Ruby versions.

#90 reproduces this error in rake-compiler-dock's CI pipeline

@larskanis I confirmed that the rubies built by ruby-builder (and downloadable from here) are built with --enable-shared.

Not sure what you think we should do here. I just pushed #91 to see what will break if we set --enable-shared on darwin.

Unfortunately, an extension library compiled with a --enable-shared'ed ruby is not compatible with a --disable-shared'ed ruby, and vice versa

@kateinoigakukun This change may have unintended consequences. Over the past two years, nokogiri precompiled for darwin has been downloaded 12,062,659 times, so I'm very worried that if we can no longer ship precompiled binaries for darwin that this is going to be a burden on both users and gem maintainers.

Is the recommended workaround to ship two bundles for darwin? One for rubies configured with --enable-shared and one for rubies configured with --disable-shared?

@flavorjones I've committed a patch to restore the compatibility nokogiri has been expecting so far: "An extension built with --disable-shared Ruby is loadable from --enable-shared Ruby".
ruby/ruby@c5eefb7

I think it would repair your build and you don't need to distribute two dylibs for darwin now :)

@kateinoigakukun Thank you! I will run that change through the rake-compiler-dock test suite!

Here's how I tested: I'm still using 3.2.0-rc1 as the cross-compiling Ruby, but added append_ldflags("-Wl,-flat_namespace") to Nokogiri's extconf.rb.

The good news is that a gem precompiled with that flag does run on darwin both with a "enable-shared ruby" and a "disable-shared ruby".

However, the bad news is that the libxml2 and libxslt symbols are no longer being resolved from within the bundle -- they are resolving against system libraries! I see these warnings:

WARNING: Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
WARNING: Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
Here's the complete output from `nokogiri -v`
$ nokogiri -v
WARNING: Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
WARNING: Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
# Nokogiri (1.14.0.dev)
    ---
    warnings:
    - Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
    - Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
    nokogiri:
      version: 1.14.0.dev
      cppflags:
      - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri"
      - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri/include"
      - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.2.0
      platform: arm64-darwin22
      gem_platform: arm64-darwin-22
      description: ruby 3.2.0dev (2022-12-06T05:55:05Z v3_2_0_rc1 81e274c990) [arm64-darwin22]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.10.3
      loaded: 2.9.13
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.37
      loaded: 1.1.35
    other_libraries:
      zlib: 1.2.13
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri

These warnings are generated by comparing the version of libxml2 present in LIBXML_DOTTED_VERSION at compile time against the value present in the global variable xmlParserVersion at runtime.

They should always match, but in gems built this way, xmlParserVersion is being resolved from system libraries.

Looking at the symbols present in the bundle, the libxml2 symbols are properly in the text (code) section, denoted with a "T":

/opt/osxcross/target/bin/arm64-apple-darwin-nm nokogiri.bundle | fgrep xmlParserVersion
00000000000c21d0 T ___xmlParserVersion
00000000002aaa10 D _xmlParserVersion

@kateinoigakukun Do you understand why the bundle isn't being used as the primary location for resolving symbols? Should I play with -fvisibility=hidden and __attribute__((visibility("default")))?

I've pushed the workaround mentioned in my previous comment here, in case anyone needs to reproduce:

sparklemotion/nokogiri#2732

I'll link to the specific warnings when CI gets there.

Hmm, very strange, it works fine in the actions runner: https://github.com/sparklemotion/nokogiri/actions/runs/3767005588/jobs/6404353192#step:5:36

Maybe this is close enough for me to cut a release candidate of Nokogiri and get feedback from people.

OK, I'm not sure why I can't reproduce this in github runners, but I've reproduced now the symbol-loading error on three different macs.

When I run vmmap on a ruby process, I can see a /usr/lib/libxml2.2.dylib entry which I believe is mapped to /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/usr/lib/ (the xcode CLT libraries). (This is also true for 3.1 rubies, so nothing new there.)

But I believe what is new is that now function calls from the nokogiri.bundle are no longer resolved to symbols that exist in the bundle, but to symbols in the ruby process. This makes sense given the explanation of the -flat_namespace change.

I'm not sure how to get unblocked from here.

OK, I made some progress with trying -load_hidden when I add the static libraries for libxml2 and libxslt to the bundle. I might be unblocked.

See sparklemotion/nokogiri@1714e77 for what seems to work.

Well, -load_hidden doesn't completely work, either -- it means all symbols are hidden and that breaks the ABI of the extensions which some other gems (like nokogiri-xmlsec) depend on (nokogiri-xmlsec needs to use the same libxml2 library that nokogiri uses).

@kateinoigakukun This all seems really bad. See some conversation with another nokogiri maintainer at sparklemotion/nokogiri#2732 where we're proposing to try to undo these changes when we build the extension.

@kateinoigakukun Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace is not a sufficient solution.

nurse commented

I reverted katei's fix with -flat_namespace from Ruby 3.2.0. I continue to discuss how to fix dynamic_lookup problem toward Ruby 3.2.1.

Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace is not a sufficient solution.

Thank you for clarifying the problem. What I am wondering now is that is the problem is really the problem?

I fixed the files as below and its result is as the same as flat_namespace. Therefore if a gem cause a problem with flat_namespace, it doesn't work on Linux.

foo_version(): 1
init()
function_in_main()
foo_version(): 1
diff --git a/Makefile b/Makefile
index 9f6f601..8df8164 100644
--- a/Makefile
+++ b/Makefile
@@ -10,29 +10,29 @@ endif

 .PHONY: all clean

-all: main main_shared extension.bundle
+all: main main_shared extension.so

-main_shared: main.c libmain.dylib libfoo1.dylib
+main_shared: main.c libmain.so libfoo1.so
        $(CC) -o $@ -DENABLE_SHARED=1 main.c -L. -lmain -lfoo1

-main: main.c libfoo1.dylib
-       $(CC) -o $@ main.c -L. -lfoo1
+main: main.c libfoo1.so
+       $(CC) -o $@ main.c -L. -rdynamic -ldl -lfoo1

-libmain.dylib: main.c libfoo1.dylib
-       $(CC) -dynamiclib -DLIBMAIN=1 -o $@ main.c -L. -lfoo1
+libmain.so: main.c libfoo1.so
+       $(CC) -shared -fPIC -DLIBMAIN=1 -o $@ main.c -L. -ldl -lfoo1

-libfoo1.dylib: foo.c
-       $(CC) -dynamiclib -DFOO_VERSION=1 -o $@ $<
+libfoo1.so: foo.c
+       $(CC) -shared -fPIC -DFOO_VERSION=1 -o $@ $<

-extension.bundle: extension.c libfoo2.a
-       $(CC) -o $@ -bundle $(NAMESPACE_OPTS) extension.c libfoo2.a
+extension.so: extension.c libfoo2.a
+       $(CC) -o $@ -shared -fPIC extension.c libfoo2.a

 foo2.o: foo.c
-       $(CC) -c -o $@ -DFOO_VERSION=2 $^
+       $(CC) -c -fPIC -o $@ -DFOO_VERSION=2 $^

 libfoo2.a: foo2.o
        $(RM) $@
        ar -cr $@ $^

 clean:
-       $(RM) main main_shared *.bundle *.dylib *.o *.a
+       $(RM) main main_shared *.bundle *.so *.o *.a
diff --git a/main.c b/main.c
index 30ba3fe..946fe9c 100644
--- a/main.c
+++ b/main.c
@@ -14,7 +14,7 @@ static
 #endif
 int main_implementation() {
     // Load the library.
-    void *handle = dlopen("extension.bundle", RTLD_LAZY | RTLD_GLOBAL);
+    void *handle = dlopen("extension.so", RTLD_LAZY | RTLD_GLOBAL);
     if (!handle) {
         fprintf(stderr, "dlopen(): %s", dlerror());
         return 1;

Thank you all 🙏 Will have a closer look next month.

@nurse Thanks for asking this question. Although I don't have time today to show that this works as expected on Linux, Nokogiri depends on this behavior working (calls from within the extension should resolve to symbols in static libraries linked) and I should be able to demonstrate that in the next day or two.

Ah - it just occurred to me that this may not be a problem on Linux unless libxml2 has been loaded.

The problem on Darwin is that XCode CLT appears to always load libxml2 into Ruby processes (which you can see if you run vmmap on any running Ruby process), at least for Ruby binaries built on the machine (via ruby-build or similar).

The answer may be that we hide those symbols, which seems to work the way we need, but might break gems that try to link against Nokogiri. I'm ok with breaking that compatibility.

@nurse You're right, of course: I had to work around symbol collision a few years back in sparklemotion/nokogiri#2106

I think the big problem here now is that, on Darwin, the Ruby process seems very likely to have loaded the system libxml2 (if Ruby was built with XCode CLT), and so suddenly all of libxml2 and libxslt are colliding with already-loaded symbols.

The two options I know of to fix symbol resolution on Darwin are described here: #92 (comment)

I've written up a brief summary of how Nokogiri is working around this at nokogiri/2022-12-darwin-symbol-resolution.md at main · sparklemotion/nokogiri · GitHub

OK, after shipping an RC for nokogiri native gems, and working on sqlite3-ruby native gems, it looks like everybody who precompiles for Darwin and Ruby 3.2 will need to add something like this to their extconf:

          if cross_build? &&
             darwin? &&
             RbConfig::CONFIG["ruby_version"] >= "3.2"
            append_ldflags("-Wl,-flat_namespace")
          end

@kateinoigakukun Do you have any ideas on how to make this situation better, potentially in a future Ruby 3.2.1 release?

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

Yes, I think so. We shouldn't demand these lines.

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

Yes, I think so. We shouldn't demand these lines.

OK, will put together a PR

If we merge #94, then I think we should cut a release soon.

Then I can confidently document how to deal with global symbol resolution in https://github.com/flavorjones/ruby-c-extensions-explained. (I think we need to start recommending that everyone build their extension (and all libraries that are statically linked into it) with -fvisibility=hidden and then only export the Init_* function, like I did in sparklemotion/sqlite3-ruby#371.)

I've merged #94 and I think we can cut a release. I'm building another rc for sqlite3-ruby based on images generated from ce619f2 and they look good.

@larskanis I think you're still the only owner of the dockerhub account, but if there is anything I can do to help with a release, please let me know?

@flavorjones As far as I can see, there is no way to add other contributors to the docker hub account. The only way is to convert "larskanis" to an organization, which doesn't feel like a good solution. Instead I think it makes sense to create a new organization and publish the images there? Maybe the name "rake-compiler" like on Github? Then we can change the name in the gem as well.

Nevertheless I can do the release and publish the images on the "larskanis" account.

I did some tests on my side (with fxruby, libusb, ffi and ruby-pg) and the latest images worked well. I found only one issue with RbConfig::CONFIG['host'], which returns x64-w64-mingw32 on ruby-3.2 on x64-mingw-ucrt instead of the x86_64-w64-mingw32 of ruby-3.1 and so it differs from the compiler name prefix. But I found that the host variable is already different on the linux images, so that I changed the code in fxruby to work with the latest images.

@larskanis I'd like to again propose that we use ghcr.io instead of dockerhub as the future home of these images. I started this work to demonstrate it could work with the weekly snapshots. We could similarly automate creation of the versioned images (and we can cache layers to ensure that image diffs stay small between patch releases).

@larskanis Would it be helpful if I did the work to cut over to ghcr.io to be the primary registry for RCD images so that I'm able to cut a release? You could then sync those images to dockerhub for convenience for folks. I should have some time this week to do that if you're not able to cut a release.

I didn't work with ghcr.io so far, so I'd be happy if you could prepare it. I don't care too much about the platform we use to distribute our images. So if ghcr.io fulfills our needs, I think we can switch the gem to it with no need to mirror the images on docker hub.

See #95! ❤️

Closing this, I think we have a path forward for Ruby 3.2.