Napi::Value is not properly initialized the first time a program is run
martenrichter opened this issue · 22 comments
I am maintaining a node.js addon that heavily switches between C++ and node.js code.
Today, I found an issue that happens only once at every program execution.
I have the following code:
auto val = Napi::Value::From(alarmjs_->Env(), timedelay);
if (val.ToNumber().DoubleValue() != timedelay) {
printf("val %lg %lg\n", val.ToNumber().DoubleValue(), timedelay);
val = Napi::Value::From(alarmjs_->Env(), timedelay);
printf("val2 %lg %lg\n", val.ToNumber().DoubleValue(), timedelay);
}
The if clause checks if the correct number is assigned to the Napi::Value.
The first time this code is called, the assignment is not correct.
Here are some examples:
Server process
val 9.61433e-312 200000
val2 200000 200000
Client process
val 1.3559e-311 3999.79
val2 3999.79 3999.79
Server process second run
val 1.21757e-311 200000
val2 200000 200000
Client process second run
val 6.54648e-312 3999.79
val2 3999.79 3999.79
The value changes, so it smells like uninitialized memory.
So far, I have tested it on a Windows build with node version v18.15.0 built by clang version 14.0.4 and "node-addon-api": "^7.0.0".
The JS part calls C++, and the JS part then calls C++, which might again call JS, etc. It is a very involved network add-on.
The problem may not be in node-addon-api but in node.js itself or v8, or I may be doing something stupid.
It does not occur on Linux with node version "v18.18.0." Also, it failed in a GitHub action on Windows, which uses node version v18.18.0.
Hi @martenrichter ,
You mentioned a GitHub Actions run. Do you have a repository with this code we can look at?
Sure:
https://github.com/fails-components/webtransport
the code in question is here:
https://github.com/fails-components/webtransport/blob/9e2309936d7f4273f1c903873c30789305037572/transports/http3-quiche/src/napialarmfactory.h#L137
You can run the tests in the root directory with:
npm install
npm dobuild
npm run test:node
make sure to uncomment the diagnosis code above.
It is in the package
https://github.com/fails-components/webtransport/tree/master/transports/http3-quiche
which is called by the main package
https://github.com/fails-components/webtransport/tree/master/main
It may be that it only occurs with the windows-2022 image and not the windows-2019 image, as I had to change it.
You find the cmake.txt here:
https://github.com/fails-components/webtransport/blob/master/transports/http3-quiche/CMakeLists.txt
So, it has a different clang version.
But it is a very complex project. I am sorry that I could not distill it to a small sample, but yesterday, at the end of the day, I was very happy to locate the source of the problem.
Hi @martenrichter ,
I broke your issue down into a simple Node API method:
Napi::Value Method(const Napi::CallbackInfo &info) {
Napi::Env env = info.Env();
double timedelay = 200000;
auto val = Napi::Value::From(env, timedelay);
if (val.ToNumber().DoubleValue() != timedelay) {
Napi::Error::New(env, "Val does not match! " +
std::to_string(val.ToNumber().DoubleValue()))
.ThrowAsJavaScriptException();
return Napi::Value();
}
return Napi::Boolean::New(env, true);
}
I cannot get this to throw the error on both Node 18.15.0 and 18.18.0 with GitHub Actions using windows-2019, windows-2022, and ubuntu-20.04.
Is your issue also present on other node versions? Specifically the latest 18.20.1 and 20.12.1?
I have tested it to 18.20.1 on windows, and it fails as with the earlier versions. (For linux it was never an issue).
I have also copied the code directly to the library initialization method, and it fails also there immediately.
So, an influence of the calling sequence can be ruled out.
I would expect that it depends on the compiler. I use cmake-js with -t', 'ClangCL'
(in version 14.0.4). The choice is caused by some libraries that do not build with the standard MSVC compiler.
Okay, I have tried to make a simple example, but it did not fail (but I think node-addon-api and/or cmake-js triggered a reinstall of build tools). My main project continued to fail in the initial Javascript call. After a complete rebuild, it does not fail anymore. So, I believe there must have been a version update in the background that fixed it. I will now try to see if it is also fixed in GitHub actions.
Here is the pull request, with removed workaround:
fails-components/webtransport#287
let's see if it fails.
Ok, it fails inside the github environment.
It worked because I accidentally switched back to the MS Visual Studio 2019 generator, but the problem only occurred for MS Visual Studio 2022.
Anyway, here is a repo with a minimal example, including a GitHub action, that fails reproducible:
https://github.com/martenrichter/bugnodeaddonapi
and of course, it is unclear if it is a bug in node-addon-api, node, v8, or the compiler.
Here:
Line 900 in 2f490a3
The value is not updated, and the code never updates the result; I have placed some printfs around.
I have tested it against a freshly build node (git master branch from the weekend), and it also occurs.
My v8 knowledge is limited but maybe I can add some debugging code into the calls to see what is happening.
Hi @martenrichter ,
Thanks for the repro code! Using windows-2022 + ClangCL definitely is the trigger for the error. I was able to install the necessary reproduction deps on my Windows dev machine here and can trigger the error as well.
This failure can be seen with the very simple Node-API C code:
#include <assert.h>
#include <node_api.h>
constexpr double val = 200000;
static napi_value Init(napi_env env, napi_value exports) {
napi_status status;
napi_value result;
status = napi_create_double(env, val, &result);
assert(status == napi_ok);
return result;
}
NAPI_MODULE(test, Init)
This module exports a double, and we can see that running it returns garbage:
PS > node -pe "require('./build/Release/test.node')"
7.87488889988e-312
PS > node -pe "require('./build/Release/test.node')"
8.548248570993e-312
PS > node -pe "require('./build/Release/test.node')"
8.59998422933e-312
PS > node -pe "require('./build/Release/test.node')"
1.208837098321e-311
As you pointed out, duplicating the napi_create_double
call results in the expected 200000
returned.
Here's a table with some condensed findings:
Visual Studio | Clang | Success? | Notes |
---|---|---|---|
Visual Studio 16 2019 | Clang 12.0.0 | ✅ | GitHub Action |
Visual Studio 17 2022 | Clang 16.0.5 | ❌ | Personal dev machine |
Visual Studio 17 2022 | Clang 17.0.3 | ❌ | GitHub Action |
I'll bring this up in the next Node API weekly meeting. In the meantime, I'll try to run the test on different Clang versions and see at what point it breaks.
Hi @KevinEady,
I am not sure if the clang version is the cause or if it just uncovers the problem.
I would argue that all the new compiler does is pass the arguments to the node part.
I wonder if printing the value inside napi_create_double
in a custom node build may help determine whether the compiler is the cause.
If the value is correct, and it also fails, it may just surface because the compiler arranges data differently in memory, which is accessed by a wrong pointer or something.
I have modifed node.js with:
napi_status NAPI_CDECL napi_create_double(napi_env env,
double value,
napi_value* result) {
printf("napi_create_double %lg\n", value);
CHECK_ENV_NOT_IN_GC(env);
CHECK_ARG(env, result);
*result =
v8impl::JsValueFromV8LocalValue(v8::Number::New(env->isolate, value));
return napi_clear_last_error(env);
}
And it prints out the same garbage, which is later inside the result.
So it seems like the compiler is not passing value properly. What I do not understand is that it works in a second call.
So I did a little more digging... Adding another piece to the puzzle:
It doesn't seem like it's the compiler version, but maybe how it's being invoked within cmake-js or the underlying MS ClangCL toolset?
When building the native module with cmake-js and ClangCL, we see the error... but if we switch to a "manual" compilation with clang++
, it seems to work fine. I was unable to get clang++ nor clangcl to work with Ninja, so opted for a simple powershell script.
We see the proper value returned for the clang++
module (GitHub Action) but not for the clangcl
module (GitHub Action).
FWIW, cmake and/or cmake-js uses a bunch of flags with ClangCL that i'm not sure of. When I build the generated project manually with msbuild
, we see:
ClCompile:
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\Llvm\x64\bin\clang-cl.exe /c /I"C:\Users\pc\.cmake-js\node-x64\v20.8.0\include\node" /I"C:\Users\pc\Documents\Projects\vcpkg\installed\x64-windows\include" /nologo
/W1 /WX- /diagnostics:column /O2 /Ob2 /D _WINDLL /D _MBCS /D WIN32 /D _WINDOWS /D NDEBUG /D "CMAKE_INTDIR=\"Release\"" /D test_clangcl_EXPORTS /EHsc /MT /GS /fp:precise /GR /Fo"test-clangcl.dir\Release\\" /Gd /TP -m64 D:\tmp\bugnode
addonapi\main.cc "D:\tmp\bugnodeaddonapi\node_modules\cmake-js\lib\cpp\win_delay_load_hook.cc"
Link:
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\Llvm\x64\bin\lld-link.exe /OUT:"D:\tmp\bugnodeaddonapi\build\Release\test-clangcl.node" /INCREMENTAL:NO /LIBPATH:"C:\Users\pc\Documents\Projects\vcpkg\installed\x6
4-windows\lib" /LIBPATH:"C:\Users\pc\Documents\Projects\vcpkg\installed\x64-windows\lib\manual-link" "C:\Users\pc\.cmake-js\node-x64\v20.8.0\win-x64\node.lib" kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut
32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /PDB:"D:/tmp/bugnodeaddonapi/build/Release/test-clangcl.pdb" /SUBSYSTEM:CONSOLE /DYNAMICBASE /NXCOMPAT /IMPLIB:"D:/t
mp/bugnodeaddonapi/build/Release/test-clangcl.lib" /DLL "test-clangcl.dir\Release\main.obj"
"test-clangcl.dir\Release\win_delay_load_hook.obj"
test-clangcl.vcxproj -> D:\tmp\bugnodeaddonapi\build\Release\test-clangcl.node
Curious, very curious... 🤔
May be a stupid question, but does the clang++ build also use the msvc stdlib?
I have asked chatgpt to explain the compile options (so handle the output with care, often it is wrong...):
/MT: This specifies the use of the multithreaded, static version of the run-time library.
/GS: This option enables buffer security checks (stack buffer overrun detection).
/fp:precise: This specifies precise floating-point semantics.
/GR: This enables RTTI (Run-Time Type Information) support.
/Fo"test-clangcl.dir\Release\\": This specifies the name and location of the object file to be produced.
/Gd: This specifies the calling convention. /Gd specifies cdecl calling convention
I would test the options "/MT" using static multthreading runtime library, the /GS option may be also something, maybe /fp:precise:
or /Gd the calling convention.
I have turned on the debugger and disassembly. I did not come far, since I have to stop now, but I have seen, that before the napi call ""D:\tmp\bugnodeaddonapi\node_modules\cmake-js\lib\cpp\win_delay_load_hook.cc" is invoked. May be this is bad guy, since it is missing in the other compile.
Hi @martenrichter,
I have been able to compile the native addon with command-line using cl
, clang-cl
, and clang++
.
So you are on the right path: using the delay load import feature is causing the undefined behavior.
The linker links with this DELAYIMP.LIB
file in order to use the /DELAYLOAD:NODE.EXE
linker argument. Removing the /DELAYLOAD:NODE.EXE
argument fixes the undefined behavior.
It makes sense that we wouldn't see it with clang++
since there is no similar concept of this.
However, I can't figure out why we see the problem with clang-cl
and not cl
. Passing the exact same arguments to cl
is successful, but to clang-cl
is a failure.
I have found the following bug report:
llvm/llvm-project#51941
It seems to match,, but it is already present in the 2019 release. Anyway, it's a good start for diagnosis.
Nice find @martenrichter ! I've written a bit on that issue. We'll see if it'll gain any traction.
So, I guess you have a few options:
- Not use the ClangCL toolset (which i suppose is not an option, since you mentioned "some libraries that do not build with the standard MSVC compiler")
- Use an in-code workaround (eg. like you did, but maybe place the
napi_create_double()
inside your module registration function? so it's "fixed" globally for your addon?) - Remove the delayload from cmake-js. It looks like cmake-js added the delayload to be compatible with Electron 4.0 on Apr 2019: PR. This might be an option if you do not need to support Electron. You can either fork the module, or see if the maintainer would add (or accept a PR that adds) an option to bypass using delayload.
Thanks for all your help troubleshooting this week! Since this is not an issue with node-addon-api / Node-API itself, I believe we can close this issue.
@KevinEady
Thanks for your help!
Yes, I also think we can close the issue, and I hope it gets some attention, as this is unexpected behavior.
I will recheck 1, but probably one of Google libs, such as quiche, boringssl, or abseil, does not work.
So I will go 2 and see if I can do it more elegantly (I checked the library registration option earlier in the morning). 3 does not seem to be nice. I do not need it, but since I provide a new browser network protocol, there may be others who do (electron is not the only one; in principle, all alternatives to node.js also may need it).
And yes we can close the issue, node-addon-api/ Node-API is innocent, but it is really annoying that clang breaks the stable API.