awslabs/aws-crt-nodejs

Segmentation Fault immediately on require inside Worker threads on Linux

mikkopiu opened this issue ยท 6 comments

Describe the bug

When using Node.js Worker threads, Segmentation fault (core dumped)/SIGSEGV is triggered when aws-crt is imported/loaded, or more specifically: when the native binary is launched.

For me, this first appeared after upgrading a project to AWS SDK JS v3 and a test case run via ava (using Worker threads) started segfaulting immediately when a module that invoked new FirehoseClient({}) was imported (which in turn, imports/uses aws-crt).

Expected Behavior

Expected aws-crt to either throw the exception implemented in #290 or just work when using Worker threads (based on #451 but I might be misunderstanding).

Ideally, I'd be able to run tests using aws-crt concurrently with ava (using Worker threads).

Current Behavior

Immediate Segmentation fault (core dumped) upon require('aws-crt') (or equivalent).

As I'm not too familiar with debugging C(++), my debugging attempts probably contain a lot of red herrings but here are some of my attempts/findings so far:

  1. Using llnode (lldb plugin), the backtrace of the minimal repro at least looks weird:

    $ llnode /usr/bin/node -c /tmp/core.123
    (llnode) v8 bt
    * thread #1: tid = 487, 0x00007f4b814c1450, name = 'node', stop reason = signal SIGSEGV
    * frame #0: 0x00007f4b814c1450
    frame #1: 0x00007f4b8d256df0 libc.so.6`__restore_rt
    frame #2: 0x00007f4b814c1450
    frame #3: 0x00007f4b8d256df0 libc.so.6`__restore_rt
    ... Repeated >5600 times
    frame #5691: 0x00007f4b8d256df0 libc.so.6`__restore_rt
    frame #5692: 0x00007f4b815b7510
    frame #5693: 0x00007f4b8d29e931 libc.so.6`__GI___nptl_deallocate_tsd + 161
    frame #5694: 0x00007f4b8d2a16d6 libc.so.6`start_thread + 422
    frame #5695: 0x00007f4b8d241450 libc.so.6`__clone3 + 48
  2. Trying to run the binary directly with lldb, crashes with SIGSEGV: address access protected:

    $ chmod +x dist/bin/linux-x64/aws-crt-nodejs.node
    $ lldb dist/bin/linux-x64/aws-crt-nodejs.node
    (lldb) run
    Process 4291 launched: '/aws-crt-nodejs/dist/bin/linux-x64/aws-crt-nodejs.node' (x86_64)
    Process 4291 stopped
    * thread #1, name = 'aws-crt-nodejs.', stop reason = signal SIGSEGV: address access protected (fault address: 0x7ffff7a8a000)
        frame #0: 0x00007ffff7a8a000 aws-crt-nodejs.node
    ->  0x7ffff7a8a000: jg     0x7ffff7a8a047
    (lldb) memory read 0x7ffff7a8a000
    0x7ffff7a8a000: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00  .ELF............
    0x7ffff7a8a010: 03 00 3e 00 01 00 00 00 00 00 00 00 00 00 00 00  ..>.............

Reproduction Steps

I've been trying to identify the meaningful variables but at least the most reproducible example is (based on #286 (comment)):

  1. Start an EC2 instance with AMI al2023-ami-2023.0.20230503.0-kernel-6.1-x86_64 (latest Amazon Linux 2023 HVM at the time of writing)

    • or equivalent Linux host, the exact flavour and kernel version don't seem to matter too much (or I might just be really unlucky)
  2. On the host, install Node.js: yum install nodejs (from the built-in repos, it's 18.12.1 at the time of writing)

  3. Core dumps: ulimit -c unlimited

  4. Create repro files and run:

    cd $(mktemp -d)
    echo '{"name": "repro","type": "module","dependencies": {"aws-crt": "1.15.16"}}' > package.json
    npm install
    echo 'import { Worker } from "worker_threads"; const worker = new Worker("./reproWorker.js");' > index.js
    echo 'import "aws-crt";' > reproWorker.js
    node index.js
    # -> Segmentation fault (core dumped)
    • In my attempts, reproduces also with all the versions listed below, and if I built aws-crt from source and required aws-crt-nodejs/dist/index.js (or the linux-x64 binary directly in CommonJS)

Possible Solution

No response

Additional Information/Context

If I'm not mistaken about aws-crt being supposed to work under Worker threads, I guess this is actually an upstream Node.js issue but as mentioned, I'm not really familiar enough with C(++) stuff and Worker threads so I haven't been able to confirm.

Here's all the setups I've been able to reproduce this with:

Versions of aws-crt:

  • 1.15.9
  • 1.15.16
  • Local version built from source at commit aafdfee

Node.js:

  • 16.19.1
  • 18.16.0
  • 18.12.1

Operating systems:

  • First saw this in a Docker container based on amazonlinux:2, running on an Ubuntu-based host
    • Linux hostname 5.15.0-1033-aws #37~20.04.1-Ubuntu SMP Fri Mar 17 11:39:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Reproduced in a Debian Bullseye container on an Alpine Linux -based host
    • Linux hostname 5.15.82-0-virt #1-Alpine SMP Mon, 12 Dec 2022 09:15:17 +0000 x86_64 x86_64 x86_64 GNU/Linux
  • Reproduced in an Amazon Linux 2023 container on a Fedora-based host
    • Linux hostname 6.2.13-300.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 27 01:33:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • and a matrix of all of the above (images/hosts/kernels)
  • Reproduced in an Amazon Linux 2023 VM, to rule out the effects of Docker
    • Linux hostname 6.1.25-37.47.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Apr 24 23:20:16 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Does NOT reproduce in macOS 13.3.1 (Ventura); Intel and M1 machines
    • and weirdly, the original ava setup works with Worker threads enabled if I just use the darwin-x64 binary on Linux (cp -a node_modules/aws-crt/dist/bin/darwin-x64/aws-crt-nodejs.node node_modules/aws-crt/dist/bin/linux-x64/aws-crt-nodejs.node)

Memory:

  • Tested on Docker containers with 4 & 8 GB memory limits
  • Tested on VMs with 16 and 32 GB of RAM

Other:

  • Not sure of all of the glibc etc. versions for all the cases (especially as I'm unfamiliar with C(++) tooling and what exactly would be relevant), but at least for the minimal repro case below, the version is 2.34 (from Amazon Linux 2023 repos)

aws-crt-nodejs version used

1.15.16

nodejs version used

18.12.1

Operating System and version

Amazon Linux 2023, AMI: al2023-ami-2023.0.20230503.0-kernel-6.1-x86_64, uname -a: Linux hostname 6.1.25-37.47.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Apr 24 23:20:16 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

This is the thread local storage crash issue mentioned near the bottom of this: aws/aws-iot-device-sdk-js-v2#360

Current plan is to push through the linked s2n patches and switch from aws-lc to openssl's libcrypto which doesn't have a thread local storage destruction problem. I don't have an ETA atm.

We have exactly the same problem currently which is blocking us from upgrading from aws-sdk v2 -> v3. Hopefully we get a fix soon ๐Ÿ™

Same issue reproduced, when running node 17.7 on ARM64 using aws-sdk/client-cognito-identity-provider package which indeed calls aws-crt and causes a SIGSEGV.
(I specifically run this on Docker alpine, error: EXITED(139)).

xer0x commented

+1 ๐Ÿ™ this has been very vexing for our team! Thank you for investigating! This has broken our AWS-CDK build process.

https://github.com/awslabs/aws-crt-nodejs/releases/tag/v1.15.19 should fix this crash.

We will update the v2 IOT SDK for Javascript shortly. For other dependency updates, please contact the maintainer of the package directly.