crystal-lang/crystal

Illegal instruction when using libssl

j-m-harris opened this issue · 7 comments

Bug Report

We've been experiencing a problem with amqproxy which was originally submitted as cloudamqp/amqproxy#174

However I suspect the problem could lie in crystal itself.

Upon negotiating the connection to the upstream server, amqproxy stops with illegal instruction.

  • We only see this on some machines, AMD EPYC with avx512 capability. Perhaps a red herring.
  • We only see this when using SSL.
  • We see this with crystal 1.13.3, 1.13.2 and 1.12.2
  • We're using alpine based containers, 84codes/crystal:1.13.2-alpine

The backtrace is:

#0  0x00ffffffffffe001 in ?? ()
#1  0x0000800000000000 in ?? ()
#2  0x00007ffff4d26990 in ?? ()
#3  0x00007ffff7866c60 in ?? ()
#4  0x00005555556ed1f2 in new (__arg0=0x7f0000008000 <error: Cannot access memory at address 0x7f0000008000>, __arg1=32767) at /usr/share/crystal/src/slice.cr:70
#5  0x00005555557e9939 in fill_buffer (self=0x7ffff7866c60) at /usr/share/crystal/src/io/buffered.cr:272
#6  0x00005555557e96a3 in read (self=0x7ffff7866c60, slice=...) at /usr/share/crystal/src/io/buffered.cr:93
#7  0x00005555557e6230 in read_fully? (self=0x7ffff7866c60, slice=...) at /usr/share/crystal/src/io.cr:542
#8  0x00005555557e60fb in read_fully (self=0x7ffff7866c60, slice=...) at /usr/share/crystal/src/io.cr:525
#9  0x0000555555878674 in from_io (io=0x7ffff7866c60) at /tmp/lib/amq-protocol/src/amq/protocol/frames.cr:65
#10 0x000055555583b600 in read_loop (self=0x7ffff7864780, socket=0x7ffff7866c60) at /tmp/src/amqproxy/upstream.cr:67
#11 0x000055555583b4f4 in read_loop (self=0x7ffff7864780) at /tmp/src/amqproxy/upstream.cr:63
#12 0x00005555556a90b9 in -> () at /tmp/src/amqproxy/channel_pool.cr:44
#13 0x00005555557295a5 in run (self=0x7ffff785bb40) at /usr/share/crystal/src/fiber.cr:143
#14 0x00005555556a282d in -> (f=0x7ffff785bb40) at /usr/share/crystal/src/fiber.cr:95
#15 0x0000000000000000 in ?? ()

Afaik crystal uses LLVM, whose trap intrinsic will generate a ud2 instruction. That could mean that SIGILL is a red herring for an invalid memory access. Or am I barking up the wrong tree completely?

The default approach is to use libssl as a shared object, but interestingly we have been unable to reproduce if shards build --static is used instead.

If linking a different build of libssl doesn't reproduce the issue, that indicates a problem with the library. For the Crystal bindings it's irrelevant whether libssl is linked statically or dynamically. Codegen is identical in both cases.
To further investigate this hypothesis, I'd suggest to try different builds of OpenSSL. Maybe build it locally to ensure compatibility with the host CPU.

@straight-shoota thanks for the response.

I agree the libs have the potential to be different when static linking is used:

  • static linking would use the libssl from alpine:latest from when the 84codes/crystal image was built.
  • dynamic linking would use libssl from alpine:latest from when amqproxy was built.

I'll investigate precisely which versions are being used in each scenario.

If you build and run in the 84codes/crystal image, both static and dynamic versions of the library are the ones available in that image. But static and dynamic versions have some differences and these could be relevant.

Pretty sure they aren't the same, as the dockerfile has multiple sources: https://github.com/cloudamqp/amqproxy/blob/main/Dockerfile#L8

We don't need to discuss how amqproxy structures their build in this project though, out of scope.

This actually provides a quite relevant bit of information.

The app is built on 84codes/crystal:1.13.2-alpine linking against the libssl.so.3 included in that image.
But then the binary is copied to a different image based on alpine:latest. So when you execute it, it uses libssl.so.3 from that image. And that could be a different library version than the one from the build image.
Apart from binary compatibility issues, if the libssl version at load time is different from compile time, the Crystal bindings' expectations about library features is incorrect.
Generally, the loader should error if it detects that the runtime library is incompatible with the linked one. But this might not be 100% accurate. It's generally best to use the same library version in the builder and the runner.
Static linking trivially ensures this, of course.

Agreed. I'll see whether using dynamic linking with libraries copied from the builder image is also stable at runtime.

Using dynamic linking with the libraries from the builder image does appear to be stable, so we'll raise an issue with amqproxy for their build approach.

if the libssl version at load time is different from compile time, the Crystal bindings' expectations about library features is incorrect.

@straight-shoota Thanks for confirming this, and taking a look.