chakra-core/ChakraCore

Incompatibility in handling of SIGSEGV between ChakraCore and CoreCLR

Closed this issue · 21 comments

I discovered an incompatibility in handling of the segmentation fault signal between the ChakraCore engine and CoreCLR. The issue IMO impacts all recent ChakraCore versions. It is triggered by a segmentation fault in user code (e.g. any NullReferenceException) in a CoreCLR-driven process and manifests itself as a stack overflow / stack smashing that crashes the process.

The following occurs after the ChakraCore engine is loaded into a CoreCLR process when a segmentation fault is triggered:

Process is terminating due to StackOverflowException.
Aborted (core dumped)

The core dump contains the following stack:

frame #0: 0x00007ff9a96d7fff libc.so.6`gsignal + 207
frame #1: 0xfffffffe7fffffff
frame #2: 0x00007ff9a979e1f7 libc.so.6`__fortify_fail + 55
frame #3: 0x00007ff9a979e1c0 libc.so.6`__stack_chk_fail + 16
frame #4: 0x00007ff44c0e94a2 libChakraCore.so`sigsegv_handler(int, siginfo_t*, void*) + 171

When debugging the crash in lldb we can see a more informative stack:

* thread #1, name = 'dotnet', stop reason = signal SIGABRT
  * frame #0: libc.so.6`__GI_raise(sig=2) at raise.c:51
    frame #1: libc.so.6`__GI_abort at abort.c:79
    frame #2: 0x00007f7616872e73 libcoreclr.so`PROCAbort + 19
    frame #3: 0x00007f761683c262 libcoreclr.so`sigsegv_handler(int, siginfo_t*, void*) + 338
    frame #4: 0x00007f760df08187 libclrjit.so`sigsegv_handler(int, siginfo_t*, void*) + 247
    frame #5: 0x00007f757571146b libChakraCore.so`sigsegv_handler(int, siginfo_t*, void*) + 171

According to my investigation, the stack overflow is only virtual. CoreCLR registers its SIGSEGV handler with the the SA_ONSTACK flag that instruct the kernel to execute the handler on a separate stack. On the other hand, ChakraCore registers its handler without the flag. The discrepancy is IMO the reason why the CoreCLR SO detection malfunctions and aborts the process. It seems that the SO detector cannot deal with the fact that the SIGSEGV handler is not executing on the separate stack CoreCLR prepared during PAL initialization.

A small repro project can be downloaded here. Crashes consistently on Ubuntu 18.04.

Once I patched the registration of the SIGSEGV handler in ChakraCore (to align it with the CoreCLR version) the crash was gone:

73c73
< static void handle_signal(int signal_id, SIGFUNC sigfunc, struct sigaction *previousAction);
---
> static void handle_signal(int signal_id, SIGFUNC sigfunc, struct sigaction *previousAction, int additionalFlags = 0);
120c120
<     handle_signal(SIGSEGV, sigsegv_handler, &g_previous_sigsegv);
---
>     handle_signal(SIGSEGV, sigsegv_handler, &g_previous_sigsegv, SA_ONSTACK);
579c579
< void handle_signal(int signal_id, SIGFUNC sigfunc, struct sigaction *previousAction)
---
> void handle_signal(int signal_id, SIGFUNC sigfunc, struct sigaction *previousAction, int additionalFlags)
583c583
<     newAction.sa_flags = SA_RESTART;
---
>     newAction.sa_flags = SA_RESTART | additionalFlags;
591a592,598
>
> #ifdef INJECT_ACTIVATION_SIGNAL
>     if ((additionalFlags & SA_ONSTACK) != 0)
>     {
>         sigaddset(&newAction.sa_mask, INJECT_ACTIVATION_SIGNAL);
>     }
> #endif

I don't know whether the issue lies within the CoreCLR SO detector or ChakraCore. Either the handler registration has to be reconciled or the SO detector has to account for the fact that an incompatible SIGSEGV handler may be registered later in the process.

The issue has been previously reported as #4893 and #5781. There may be also another incompatibility between ChakraCore and CoreCLR that prevents analysis of managed core dumps (maybe the ChakraCore PAL is an incompatible copy-paste from CoreCLR?).

@tomasdeml Try to upgrade the JavaScriptEngineSwitcher.ChakraCore.Native.linux-x64 package to version 3.0.6 in your repro project.

I checked my recommendation myself. Unfortunately, this problem is relevant for the ChakraCore version 1.11.6.

Yes, the issue plagues all versions of ChakraCore.

@tomasdeml This isn't an area I'm very familiar with, but it looks like you figured out a fix? If you make a PR to fix the issue I'm happy to merge it.

@MikeHolman I don't think my workaround constitutes a proper fix. The SA_ONSTACK flag usually requires the application to preallocate and install an alternate signal stack for the handler to execute on. When such stack is not provided, the handler will execute on the default stack. ChakraCore does not allocate an alternate signal stack today but CoreCLR does and that is why the workaround "just works".

A similar issue has been raised and solved in coreclr, see dotnet/coreclr#16276. The question is why the fix does not apply in this case...

any news on this?

any news on this ? Since it is severity 1 and actually it is, what can we do? We have the same issue in prod environment and had to change the OS ! back to windows again. We need help please asap if possible.

We were using JavaScriptEngine Switcher and switched implantation from ChakraCore to Jint, which works on DotNetCore on Linux without issue. https://github.com/Taritsyn/JavaScriptEngineSwitcher

Thank you in advance. We will try it asap.

Last update: JInt format was not suitable with our js's. We used Jurassic and it seems ok. Tomorow we will be doing a high scale test. Thank you again.

I am experiencing the exact same problem described here. It is safe to assume this won't be addressed any time in the near future?

I'll take a look into this next week; thanks for reporting this, and thanks for your patience!

Any news on this?

@divmain @rhuanjl I know a lot of people who, due to this error, cannot use the ChakraCore library in .NET Core applications on Linux. Yesterday, Nikita Tsukanov (@kekekeks) suggested that I build a native assemblies with the Tomáš Deml's patch applied, source code of which is contained in this discussion, and test it. For testing, I used my test suite, which I ran on 64-bit versions of Windows 10, Ubuntu 16.04.1 and macOS 10.12. All tests worked correctly, and most importantly, periodically occurring errors on Ubuntu completely disappeared. Therefore, I included these assemblies in the JavaScript Engine Switcher version 3.4.5. I suggest adding the Tomáš Deml's patch along with security patches to the next release of ChakraCore v1.11.

@tomasdeml, I think it makes sense to submit your patch as PR.

@Taritsyn We can include this fix in planned future community releases (ETA ~June for the first such)

In terms of getting it applied to 1.11 by microsoft that is a question for @divmain

For private usage of chakracore in the meantime an individual can of course apply this fix themselves and build their own chakracore, not ideal I know - but at least there is a fix. If we can create a minimal repro to test it we could add the fix to the master branch in the next couple of weeks.

Sure, I can create a PR with the patch. However, I am not sure whether it won't break other ChakraCore hosts, e.g. C/C++ applications embedding the engine:

I don't think my workaround constitutes a proper fix. The SA_ONSTACK flag usually requires the application to preallocate and install an alternate signal stack for the handler to execute on. When such stack is not provided, the handler will execute on the default stack. ChakraCore does not allocate an alternate signal stack today but CoreCLR does and that is why the workaround "just works".

Any ideas?

@tomasdeml
There could be a special API that would enable .NET Core compatible behavior

This sounds like it could do with more investigation by someone a bit more familiar with PAL and CLR. The PAL within ChakraCore is a very old copy&paste from CoreCLR that has then been occasionally edited by the ChakraCore team.

I'm not really familiar with Core CLR - we could certainly add an API to a future version of ChakraCore if that's what's needed BUT I'd say there's 0 chance of getting that into a 1.11 patch - would there alternatively be a way that ChakraCore could detect if it's being used with .NET Core and react accordingly?

We can include this fix in planned future community releases (ETA ~June for the first such)

It would be nice.

For private usage of chakracore in the meantime an individual can of course apply this fix themselves and build their own chakracore, not ideal I know - but at least there is a fix.

For now, I will continue to release new versions of the JavaScript Engine Switcher with this patch, because only .NET developers use it.

For another thought - in the master branch of CC we could try updating signal.cpp to the latest version from the CoreCLR and seeing if that solves this issue, our current version (from the CORECLR of about 4 years ago is vastly different to the latest) - in general it may be worth updating a lot of the PAL files to newer versions.

EDIT: in the alternative/or additionally we could explore deleting/removing a lot of this code; I'm unsure if ChakraCore internally needs this signal handling, not sure if someone else could weigh in on that? For reference:

  1. a lot of it was removed some time ago with a comment left to consider further clean up later:
    #2086
  2. much of it was added apparently to enable building of GCStress - a testing tool built alongside CC rather than CC itself - it's unclear why code added for that is included in CC itself (EDIT2: looks like it is used more heavily than the PR comments indicate on linux, wholesale removal is unlikely to be possible):
    #408