nodejs/node

[Regression in 14.1.0 - Windows] `stdout` is sometimes empty

nicolo-ribaudo opened this issue · 148 comments

  • Version: 14.1.0
  • Platform: Windows 64-bit on VirtualBox and on Travis CI
  • Subsystem: N/A

Bug description

After the Node.js 14.1.0 release, Babel's CI tests on Windows started failing. The failures are all related to @babel/cli and @babel/node: we run those CLI programs, capture their output (from stdout) and compare it with the expected output.

Sometimes, the generated stdout is empty: you can see an example here or here. In this CI log you can also see some error messages coming from V8 internals in @babel/node tests, but I don't know if it's the same problem.

How often does it reproduce? Is there a required condition?

My guess is that every test has about 1% chance of failing. However, we have ~100 tests for those 2 packages so something is failing more often than not.

What steps will reproduce the bug?

I couldn't find a small and isolated reproduction example. I'll keep trying to create one, but here is what I have for now.
Also, building Babel on Windows is painful and I couldn't run the full test suite. However, I managed to reproduce the bug.

I'm running these commands using Powershell, Nodej.s 14.1.0 and Yarn 1.22

# Clone the babel repository
git clone https://github.com/babel/babel.git .

# Install deps
yarn
yarn lerna bootstrap # This takes a while

# Build Babel (this won't build everything, but enough packages to show the bug)
yarn gulp build-no-bundle

# Run the @babel/cli and @babel/node tests
yarn jest --runInBand babel-node babel-cli

If you don't see the last command failing, try running it 2 or three times. The --runInBand option isn't necessary to reproduce the problem, but it disables Jest's workers so it removes one possible cause.

What is the expected behavior?

Tests should pass

What do you see instead?

An example of output failing locally is this:

PS C:\Users\Nicolò\Downloads\babel-master\babel-master> yarn jest --runInBand babel-node babel-cli
yarn run v1.22.4
$ C:\Users\Nicolò\Downloads\babel-master\babel-master\node_modules\.bin\jest --runInBand babel-node babel-cli
 FAIL  packages/babel-cli/test/index.js (33.099s)
   bin/babel  --only glob

    "C:\Users\Nicolò\Downloads\babel-master\babel-master\packages\babel-cli\lib\babel" "--presets" "C:\Users\Nicolò\Downloads\babel-master\babel-master\packages\babel-preset-react" "--plugins" "C:\Users\Nicolò\Downloads\babel-master\babel-master\packages\babel-plugin-transform-arrow-functions,C:\Users\Nicolò\Downloads\babel-master\babel-master\packages\babel-plugin-transform-strict-mode,C:\Users\Nicolò\Downloads\babel-master\babel-master\packages\babel-plugin-transform-modules-commonjs" "src" "--out-dir" "lib" "--only" "**/*.foo.js" "--verbose": expect(received).toBe(expected) // Object.is equality

    Expected: "src/a.foo.js -> lib/a.foo.js
    src/baz/b.foo.js -> lib/baz/b.foo.js
    Successfully compiled 2 files with Babel."
    Received: ""

      85 |       expect(stdout).toContain(expectStdout);
      86 |     } else {
    > 87 |       expect(stdout).toBe(expectStdout);
         |                      ^
      88 |     }
      89 |   } else if (stdout) {
      90 |     throw new Error("stdout:\n" + stdout);

      at assertTest (packages/babel-cli/test/index.js:87:22)
      at ChildProcess.<anonymous> (packages/babel-cli/test/index.js:152:9)

 PASS  packages/babel-node/test/index.js (22.924s)

Test Suites: 1 failed, 1 passed, 2 total
Tests:       1 failed, 89 passed, 90 total
Snapshots:   0 total
Time:        56.088s, estimated 62s
Ran all test suites matching /babel-node|babel-cli/i.
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Additional information

  • You can see here how we capture stdout for @babel/cli tests, . For @babel/node it's the same.

  • @JLHwung suggested that this might be related to #33058 or #33076

  • The bug is not present when using Node.js 14.0

cc: @nodejs/streams @ronag

Can you confirm this happening only on Windows? This is widely unexpected.

ronag commented

I can't reproduce on OSX. @nicolo-ribaudo any chance you could bisect? I don't have a Windows machine at the moment.

ronag commented

I'm unsure where to go with the V8 errors. @addaleax maybe?

Yes, I can only reproduce this on windows (not the V8 errors), only the empty stdouts. I'll try to bisect.

@ronag I’m not sure… Stack traces on Windows are basically garbage unless you take some extra steps (that I don’t know how to take), and given that the error message isn’t particularly helpful here either – there are over 2000 instances of UNREACHABLE(); in V8 – I’m not quite sure where to start here.

/cc @nodejs/v8 @nodejs/platform-windows

Update: this might not be a Node.js bug.

Today I made two local clones of the Babel repository: one ~6 hours ago and I can consistently reproduce the errors there, and one ~2 hours ago where I cannot reproduce the errors. This makes me think that maybe it's a bug in a dependency that didn't support Node.js 14.1.0 on windows, and it was recently updated to fix the bug.

Even if there was a bug in a dependency I don't think that it should trigger the V8 internal error, but I cannot reproduce that one anyway 🤷

I'll try to re-run the build on Travis CI a bunch of times, and see if I can reproduce it there.

ronag commented

This makes me think that maybe it's a bug in a dependency that didn't support Node.js 14.1.0 on windows, and it was recently updated to fix the bug.

Can you diff the dependencies?

Yeah I'm trying. I'm on a Windows VM and I haven't used Windows for a while, so it might take some time 😛

Also, another thing I noticed: in the folder where I can (consistently!) reproduce the problem, I can only reproduce it when using Node.js 14.1.0. If I use a build form master (4b2d958) it doesn't fail.

ronag commented

If I use a build form master (4b2d958) it doesn't fail.

Yea, a bisect would be very useful.

I gave up for today, I will continue tomorrow.
Also, I noticed that Jest's own tests have the same problem: https://github.com/facebook/jest/pull/9934/checks?check_run_id=633574558 (cc @SimenB)

dr-js commented

Not sure if it's related, I got win10 + nodejs@14 + babel on CI with flaky ELIFECYCLE/3221225477 exit code in one of my repo: nodejs/help#2660

Update:

I've reduced CI exit test case to some of my source files and babel + 2 babel plugins
The code is at: https://github.com/dr-js/dr-dev/tree/8e111ec5194adc3e159db2d3bc3594f8587dc230
The latest CI fail: https://github.com/dr-js/dr-dev/actions/runs/92784081

Update 2:

With more debug log found the exit happen before the @babel/env plugin gets load, after minify-replace and module-resolver loaded.
Related CI run: https://github.com/dr-js/dr-dev/actions/runs/92845322

Yeah, Jest is failing often on Node 14.1.0 on Windows with empty stdout in one test or the other almost every test run. The V8 thing is actually in the latest build on master: https://github.com/facebook/jest/runs/634830924

Ok, have no idea where to start bisecting. I can still consistently reproduce the failures with the Node.js 14.1.0 installed with the installer, but not if I build 9ffd8e7.

Ok, have no idea where to start bisecting. I can still consistently reproduce the failures with the Node.js 14.1.0 installed with the installer, but not if I build 9ffd8e7.

@nodejs/releasers can you check if there could be any differences between what is available via the installer and what was tagged?

I think there's no way to be sure it's the promoted one, but I can confirm that the latest release build was done with commit 9ffd8e7
Refs: https://ci-release.nodejs.org/job/iojs+release/5995/
The times of the file on the server and in the build logs also match.

@jasnell I know you have a beefy windows machine, could you check this out and run a bisect?

Yeah, Jest is failing often on Node 14.1.0 on Windows with empty stdout in one test or the other almost every test run. The V8 thing is actually in the latest build on master: https://github.com/facebook/jest/runs/634830924

@SimenB This is worrysome. Are you using any wasm inside Jest?

I'm building the release commit on my machine (Win10, VS2019)

@mcollina nope, only experimental stuff we ship is the VM ESM APIs, but that code is guarded behind a check that vm.SyntheticModule is defined, which is flagged.

EDIT: That said, some dependency might? Seems unlikely, though - node 14.0 behaved fine, and both mac and linux on 14.1 behaves fine

Well, we'd need some more simple and reliable way to reproduce this.
I cloned Jest and tried to run e2e\__tests__\toThrowErrorMatchingInlineSnapshot.test.ts multiple times with my local build and the release but it never errors.

That said, some dependency might? Seems unlikely, though - node 14.0 behaved fine, and both mac and linux on 14.1 behaves fine

The V8 stacktrace you posted shows a failure within some code in the wasm namespace in V8.

Hopefully @targos can shed some light on this.

Commits between 14.0.0 and 14.1.0: v14.0.0...v14.1.0
3 of them to deps/v8:

I'm now trying to remove as much code as possible from my local failing clone of Babel to reduce the number of possible error causes.

This seems to be not babel specific: https://github.com/pinojs/pino/runs/642073186. Again, this is consistently flaky :(.

@targos did you run the tests in Linux WSL or using the normal bin? I have a hunch that Github CI is using WSL.

I used the normal bin in PowerShell

Would you mind to try out if it works in WSL?

@mcollina what exactly do you want me to try?
BTW I cloned pino, npm i, npm t with Node.js 14.1.0 on Windows and all tests pass 😕

Can you do the same inside WSL?

No problem inside WSL either.

I think this may be only reproducible if stdout is not a terminal. Is there a simple way to test this on Windows?

Note that which test fails on Jest's CI varies, and not all test runs even have a failure. So I think this might be super hard to nail down a consistent reproduction for 😟

@nicolo-ribaudo given that you are the only one with a repro on your development machine, could you try to isolate which commit is causing the problem (e.g. bisect)?

I can try again, but if I build the release commit with .\vcbuild I can't reproduce the issue.

I can try again, but if I build the release commit with .\vcbuild I can't reproduce the issue.

Oh sorry, then don't bother :(. So this only happens with the binary we have built.

Maybe in 14.1.1, with a new binary, it will magically fix itself 🤷‍♂️

@nicolo-ribaudo are you running the tests in WSL or cmd or powershell? Neither @targos nor myself can repro them.

I'm using powershell.
I don't know if it's possible/legal, but I can try to share my virtualbox image with you.

I ran a test rebuild of the v14.1.0 release commit 9ffd8e7 on the Node.js infrastructure - this could rule out whether anything obscure went wrong with the published release build.

https://nodejs.org/download/test/v14.1.0-test202005049ffd8e7a74/

@nicolo-ribaudo do you know how I can run jest with custom Node runtime on Windows?

I used path/to/node.exe ./node_modules/jest/bin/jest babel-node babel-cli.

@BethGriggs I'll check it later, thanks!

I don't know if it's relevant to this specific issue, but I did notice that between 13.11 and and 13.12, when any stream piping to process.stdout or process.stderr closes, process.stdout/process.stderr end up closing allowing no further output. This behavior is carried over into 14.x and is on all platforms...

To give it a quick test, nvm use 13.11 and try...

$ node
Welcome to Node.js v13.11.0
Type ".help" for more information.
> const w = new stream.PassThrough()
undefined
> stream.pipeline(w, process.stdout, () => {})
<truncated>
> w.end('hello')
<truncated>

Then nvm use 13.12 and do the same thing.

You'll see that on 13.11, process.stdout is still usable while on anything beyond 13.12, it is not.

/cc @ronag

@jasnell I don't think that it is related, because the problem reported in this issue started appearing in 14.1.0

I tested the test release and I could still reproduce the bug.
@BethGriggs is there any flag I should use to produce a release-like build locally?


Also, I tried to think about what my env has in common with CI servers, to understand why I'm the only one that can reproduce the bug.
The only thing I noticed is that in this VM I have a very low available disk space (1.5GB), and it might similar to CI.

I can replicate the error by running babel tests with the shipped release. I cannot replicate it by building manually the release commit.

I can confirm there is something odd happening here.

For Windows builds our Release CI calls vcbuild.bat build-release x64 upload. This is building on Windows 2012 R2 (Rackspace) with Visual Studio 2019.

11:46:03 c:\ws>vcbuild.bat build-release x64 upload 
11:46:03 Looking for Python
11:46:03 Python found in C:\Python27\\python.exe
11:46:03 Looking for NASM
11:46:03 Looking for Visual Studio 2019
11:46:03 Looking for WiX installation for Visual Studio 2019...
11:46:03 calling: "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\\Auxiliary\Build\vcvarsall.bat" amd64
11:46:04 **********************************************************************
11:46:04 ** Visual Studio 2019 Developer Command Prompt v16.5.4
11:46:04 ** Copyright (c) 2019 Microsoft Corporation
11:46:04 **********************************************************************
11:46:04 [vcvarsall.bat] Environment initialized for: 'x64'
11:46:04 Found MSVS version 16.0
11:46:04 configure  --with-ltcg --release-urlbase=https://nodejs.org/download/test/ "--download=all" --with-intl=full-icu --dest-cpu=x64 --tag=test202005049ffd8e7a74
11:46:11 Node.js configure: Found Python 2.7.18...
11:46:11 Warning: Missing input files:
11:46:11 tools\msvs\genfiles\node_etw_provider.rc
11:46:11 tools\msvs\genfiles\node_etw_provider.h
11:46:11 INFO: configure completed successfully

@jasnell FYI the babel test error is not reproducible on both node.js 13.12 and 13.13: https://travis-ci.com/github/JLHwung/babel/builds/163828049

Unfortunately that's what I suspected :-(

@mcollina can you please try moving the commit built binary to a different folder and try running the tests again?

@mcollina can you please try moving the commit built binary to a different folder and try running the tests again?

Same


@BethGriggs I am using Visual Studio 2017. I'm trying to run vcbuild.bat build-release x64 but it cannot find Wix. I do not know how to install that :/, there are no guides linked in BUILDING.md.

I don't think build-release creates a different node.exe but I might be wrong. Wix is needed to create the .msi installer but that's separate from creating node.exe

bzoz commented

build-release enables LTCG

I can confirm that the releases of 14.2.0 do not solve this issue.

@bzoz can you point me to a doc to install Wix so that vcbuild can find it?

@mcollina try .\vcbuild.bat release (no need for Wix). That should enable LTCG

try .\vcbuild.bat release (no need for Wix). That should enable LTCG

This build is not causing any errors as well.

@mcollina could you give me the steps you followed to reproduce the bug on your machine with the release build? I have the same version of visual studio than CI on my machine, so we could eliminate this difference at least if I'm able to reproduce.

# Clone the babel repository
git clone https://github.com/babel/babel.git .

# Install deps
yarn
yarn lerna bootstrap # This takes a while

# Build Babel (this won't build everything, but enough packages to show the bug)
yarn gulp build-no-bundle

# Run tests with custom node
path\to\node.exe node_modules\jest\bin\jest babel-node babel-cli

Using the release node.exe fails, while using the one I build it does not fail.

(I download the zip file with the binary)

One thing that might be worth testing is running with --run-in-band. Without this flag Jest will spawn processes to parallelize test runs. The flag will disable this behavior and run everything in the main process. Might eliminate one source of the error? (if you only have a single core the tests will already run in band, so might not be applicable to VMs and such, depending on how accurate os.cpus() are in those cases)

Process/worker abstraction, for reference: https://github.com/facebook/jest/tree/master/packages/jest-worker

I still can't reproduce using the official release (with or without --runInBand) :(

PS D:\Git\forks\babel> node .\node_modules\jest\bin\jest.js --runInBand babel-node babel-cli
 PASS  packages/babel-cli/test/index.js (49.684s)
 PASS  packages/babel-node/test/index.js (33.254s)

Test Suites: 2 passed, 2 total
Tests:       90 passed, 90 total
Snapshots:   0 total
Time:        83.059s, estimated 84s
Ran all test suites matching /babel-node|babel-cli/i.

The fact that we cannot do a build that cause a failure on a machine worries me a lot.

Do you think it would be possible to do a manaul git-bisect through the build server? I can download and run them on my machine to see which one cause the failure.

@mcollina sure, let's do that. So 14.0.0 works?

I do not have permission to access that Jenkins instance.

14.0.0 works?

Yes.

I'll post links to the dowloads when they are ready

@mcollina I did an rc.0 build for v14.1.0 on f878917 (that was prior to pulling in additional requested V8 and streams PR's IIRC). It might be worth trying - https://nodejs.org/download/rc/v14.1.0-rc.0/

I'm also thinking in some way could this be a VS 2019 vs VS 2017 issue? Just based on the release machine and @targos using VS 2019 and not being able to reproduce the issue.

Also ping @nodejs/build in case something has changed with our Windows release infrastructure recently

@mcollina Yarn might not work with those binaries if they have a prerelease version. If it doesn't, you need to run the first commands with 14.1.0 or 14.0.0, and then only jest with the test build.

I'll also test those binaries so that we don't risk a false negative, since the failures are flaky.

Yarn might not work with those binaries if they have a prerelease version. If it doesn't, you need to run the first commands with 14.1.0 or 14.0.0, and then only jest with the test build.

That's what I have been doing. I got a nice crack on this.

@BethGriggs the test build is finished. Do I need to do something to publish it to the website?

Edit: trying again with a datestring: https://ci-release.nodejs.org/job/iojs+release/6013/
It says that it's only needed for nightly but I see you provided it yesterday too.

I'm also thinking in some way could this be a VS 2019 vs VS 2017 issue?

@BethGriggs that's my working theory, however it seems a bit of a long shot. Have we done any changes to the windows machines between 14.0.0 and 14.1.0?

I did an rc.0 build for v14.1.0 on f878917 (that was prior to pulling in additional requested V8 and streams PR's IIRC). It might be worth trying - https://nodejs.org/download/rc/v14.1.0-rc.0/

This still fails.

Does this fail with any of the nightlies (if it does then bisecting might be quicker if we can narrow down a passing/failing set of builds)? https://nodejs.org/download/nightly/

@targos recent changes to our caching means it can take up to ~30minutes for a build to be available on our download server, so you may hit that.

Does this fail with any of the nightlies (if it does then bisecting might be quicker if we can narrow down a passing/failing set of builds)? https://nodejs.org/download/nightly/

This is not possible because of some other semver-major changes in master that are actually breaking babel. It's emitting some deprecation warnings that are making babel test failing, so it's very hard to have a yes/no.

I'm testing the nightly versions:

"works" = "passed 4 attempts"

@mcollina I disabled the stderr assertion, so that the warning doesn't break the tests. If you want to try it, you can comment out the if starting with if (opts.stderr) { in packages/babel-cli/test/index.js and packages/babel-cli/test/index.js

@BethGriggs It's been almost an hour and the build is still in staging. Did your test from yesterday get moved automatically?

I'm testing the nightly versions:

* [24-Apr-2020](https://nodejs.org/download/nightly/v15.0.0-nightly202004240619b600b2/) - works

* [26-Apr-2020](https://nodejs.org/download/nightly/v15.0.0-nightly20200426f8d5474839/) - broken

"works" = "passed 4 attempts"

These are the commits between those two nightlies:
0619b60...f8d5474

I'm downloading 25-Apr-2020 to reduce that range.

EDIT: 25-Apr-2020 works.

My guesstimate are ab7d9db or ab7d9db. They were the highly culprits because they touched relevant parts of stdout. @ronag could you check as well?

However the biggest question remain on why I cannot reproduce them by building locally.

@BethGriggs It's been almost an hour and the build is still in staging. Did your test from yesterday get moved automatically?

@targos I think your first build might not be promoted due to the promotion script expecting a specific format - nodejs/build#2309 (comment). My test build uploaded itself after around 30 minutes.

Got it!

@mcollina The first build of this interactive bisect session is here: https://direct.nodejs.org/download/test/v14.0.1-test20200505f92b398c76/
Tell me the result if you want to continue :)

ronag commented

My guesstimate are ab7d9db or ab7d9db

Those two are the same, I think you meant two different ones?

003fb53 and ab7d9db

Next build is on commit 23eda41: direct.nodejs.org/download/test/v14.0.1-test2020050523eda417b6

Fail

Next build is on commit 1af08e1: https://direct.nodejs.org/download/test/v14.0.1-test202005051af08e1c5e/

Still building, but I'm going to bed. Expect the link to work in ~45 minutes.
Let's continue tomorrow!

Hey, I'm ready to build more as needed :)

Next build is on commit 4143c74: https://direct.nodejs.org/download/test/v14.0.1-test202005064143c747fc/

(only one revision left to test after that)

Next build is on commit 4143c74: direct.nodejs.org/download/test/v14.0.1-test202005064143c747fc

Fail

I'm very confused (if I haven't missed any pass/fail, the bug is in the second or third commit of 737bd62...4143c74).

4143c74 is the first bad commit
commit 4143c74
Author: Gus Caplan
Date: Tue Apr 21 16:43:58 2020 -0500

vm: add importModuleDynamically option to compileFunction

Fixes: #31860

PR-URL: #32985

I guess all of us who tried to guess the bad commit are bad guessers 😛

@mcollina Could you confirm that #33166 (comment) passes and #33166 (comment) fails?

That one's backported to v13, fwiw, so seems unlikely 😕 That said, Jest makes heavy use of compileFunction (although not that new option yet) and vm in general