crutchcorn/cli-testing-library

fireEvent.sigkill fails on some CI systems

crutchcorn opened this issue · 23 comments

While this code runs on all the same versions of Node on Windows within CI, it seems to error out on platforms like Windows.

Log:

https://github.com/crutchcorn/cli-testing-library/runs/4339953929?check_suite_focus=true

It appears to be a permissions issue of being unable to taskkill (or similar) a PID on Windows machines on GH Actions CI. This makes sense to me, as it may introduce vulnerabilities.

As a temporary workaround, I will be removing that test from CI and simply doing tests manually on Windows myself. Please let us know if there's a better way you know of

OK I'm stumped. I've tried for hours and hours now with no avail. It seems to work fine in CircleCI, but not GitHub actions.

I'm going to mark this as "help needed" and hope the ecosystem has some kind of solution

Note: This library works fine for Windows itself. I will be manually testing everything on my machine to make sure it works and continue to write Windows-specific tests

OK it's not just sigkill and it's not just Windows

https://github.com/crutchcorn/cli-testing-library/actions/runs/1558975085

There is some timing issue somewhere, and I don't think it's our code, but I could be wrong.

I fixed the timing issue (Node 14 seems to have changed something with spawn behavior that Node 12 doesn't have)

Sure enough, both Windows and Ubuntu tests pass now

https://github.com/crutchcorn/cli-testing-library/runs/4522732155?check_suite_focus=true

https://github.com/crutchcorn/cli-testing-library/runs/4522795843?check_suite_focus=true

We're running into this issue again in Plop as Windows devices are failing again.

I'm willing to pay a bounty of $50 if anyone is able to solve these issues well enough that Windows CI systems do not fail on GitHub actions on the Plop monorepo 5x in a row.

/bounty $50

💎 $50 bounty created by crutchcorn
🙋 If you start working on this, comment /attempt #3 to notify everyone
👉 To claim this bounty, submit a pull request that includes the text /claim #3 somewhere in its body
📝 Before proceeding, please make sure you can receive payouts in your country
💵 Payment arrives in your account 2-5 days after the bounty is rewarded
💯 You keep 100% of the bounty award
🙏 Thank you for contributing to crutchcorn/cli-testing-library!

Attempt Started (GMT+0) Solution
🟢 @smartbizlord Dec 31, 2023, 5:42:05 AM WIP
🟢 @Waldeedle Jan 1, 2024, 8:13:43 PM #26
Drex72 commented

OK I'm stumped. I've tried for hours and hours now with no avail. It seems to work fine in CircleCI, but not GitHub actions.

I'm going to mark this as "help needed" and hope the ecosystem has some kind of solution

Note: This library works fine for Windows itself. I will be manually testing everything on my machine to make sure it works and continue to write Windows-specific tests

Hey @crutchcorn Is it possible for me to get the config that you used for circle ci?

@Drex72 it might be in the commit history. Otherwise I wouldn't know where it is :( Sorry I'm not more help

Drex72 commented

@Drex72 it might be in the commit history. Otherwise I wouldn't know where it is :( Sorry I'm not more help

Alright, no problem
that's cool
i'd check for it
i cant see the logs because it says it doesnt exist so i want to create a repro :)

Failing job in Plop
https://github.com/plopjs/plop/actions/runs/7304409077/job/19906537229 windows-latest, Node.js 20.x

plop:test
  cache miss, executing b00a87c8e47ca0af
  
  > plop@4.0.1 test:instrument
  > nyc instrument ./bin ./instrumented/bin && nyc instrument ./src ./instrumented/src && cp package.json ./instrumented
  
  Browserslist: caniuse-lite is outdated. Please run:
    npx browserslist@latest --update-db
    Why you should do it regularly: https://github.com/browserslist/browserslist#browsers-data-updating
  Browserslist: caniuse-lite is outdated. Please run:
    npx browserslist@latest --update-db
    Why you should do it regularly: https://github.com/browserslist/browserslist#browsers-data-updating
  
  > plop@4.0.1 vitest
  > vitest run
  
  
   RUN  v1.1.0 D:/a/plop/plop/packages/plop
  
   ✓ tests/actions.spec.js  (6 tests | 4 skipped) 6435ms
   ✓ tests/wrapper.spec.js  (5 tests) 15245ms
   ✓ tests/esm.spec.js  (4 tests) 11460ms
   ✓ tests/action-failure.spec.js  (1 test) 3675ms
   ✓ tests/typescript.spec.js  (1 test) 2179ms
   ✓ tests/input-processing.spec.js  (16 tests | 1 skipped) 35128ms
  
  ⎯⎯⎯⎯⎯⎯ Unhandled Errors ⎯⎯⎯⎯⎯⎯
  
  Vitest caught 1 unhandled error during the test run.
  This might cause false positive tests. Resolve unhandled errors to make sure your tests are not affected.
  
  ⎯⎯⎯⎯ Unhandled Rejection ⎯⎯⎯⎯⎯
  Error: Command failed: taskkill /pid 1544 /T /F
  ERROR: The process with PID 5052 (child process of PID 1544) could not be terminated.
  Reason: The operation attempted is not supported.
  
  ERROR: The process with PID 1544 (child process of PID 5[72](https://github.com/plopjs/plop/actions/runs/7304409077/job/19906537229#step:5:74)8) could not be terminated.
  Reason: The operation attempted is not supported.
  
  
   ❯ ChildProcess.exithandler node:child_process:422:12
   ❯ ChildProcess.emit node:events:517:28
   ❯ maybeClose node:internal/child_process:1098:16
   ❯ Process.ChildProcess._handle.onexit node:internal/child_process:303:5
  
  ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
  Serialized Error: { code: 128, killed: false, signal: null, cmd: 'taskkill /pid 1544 /T /F' }
  This error originated in "tests/input-processing.spec.js" test file. It doesn't mean the error was thrown inside the file itself, but while it was running.
  The latest test that might've caused the error is "Should bypass input prompt with name". It might mean one of the following:
  - The error was thrown, while Vitest was running this test.
  - This was the last recorded test before the error was thrown, if error originated after test finished its execution.
  ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
  
   Test Files  6 passed (6)
        Tests  28 passed | 5 todo (33)
       Errors  1 error
     Start at  22:39:53
     Duration  36.38s (transform 179ms, setup 1.64s, collect 657ms, tests [74](https://github.com/plopjs/plop/actions/runs/7304409077/job/19906537229#step:5:76).12s, environment 2ms, prepare 1.[78](https://github.com/plopjs/plop/actions/runs/7304409077/job/19906537229#step:5:80)s)
  
  npm ERR! Lifecycle script `vitest` failed with error: 
  npm ERR! Error: command failed 
  npm ERR!   in workspace: plop@4.0.1 
  npm ERR!   at location: D:\a\plop\plop\packages\plop 
  Error:  command finished with error: command (D:\a\plop\plop\packages\plop) C:\Users\RUNNER~1\AppData\Local\Temp\xfs-42d[81](https://github.com/plopjs/plop/actions/runs/7304409077/job/19906537229#step:5:83)1ce\yarn.cmd run test exited (1)
Error: plop#test: command (D:\a\plop\plop\packages\plop) C:\Users\RUNNER~1\AppData\Local\Temp\xfs-42d811ce\yarn.cmd run test exited (1)

 Tasks:    1 successful, 2 total
Cached:    0 cached, 2 total
  Time:    54.193s 
Failed:    plop#test

 ERROR  run failed: command  exited (1)
Error: Process completed with exit code 1.

@crutchcorn just want to be clear, you are not seeing this issue for yourself locally? I was able to reproduce it on my local windows 11 desktop

@Waldeedle INTERESTING. I, in fact, was not able to reproduce it locally even recently on the Plop repo.

@crutchcorn try cloning the repo to a new folder and just run the exact commands from your test workflow:
I think your hunch is right, I wonder if turbo is not handling child processes correctly because I can only reproduce it on the initial clone of the repo, any subsequent run just works.
EDIT: A process that could not be killed from the first run continues to run in the background throughout these subsequent runs:
image

Here's some copy-pasta for your convenience

git clone git@github.com:plopjs/plop.git
cd plop/
yarn install --frozen-lockfile
yarn test

@Waldeedle it's interesting it mentions Microsoft AV. What if you turn it off (temporarily) for running? [Only if you trust it, feel free to read the source]

I have mine disabled in favor of a different antivirus and maybe that could be part of the root.

@crutchcorn assuming you meant from my screenshot? Sorry should've clarified that this was from my task manager, its going to show all processes, I will switch over to process explorer so I can see the related processes from the test

/attempt #3
I see where it is erroring out, and believe I have a potential solution.
Before I create a PR, could I get context around, this particular if statement, why are you attempting to kill a process that has already exited/doesn't exist?

if (err.message.includes('could not be terminated') && err.message.includes('There is no running instance of the task.'))
Options

Note

The user @smartbizlord is already attempting to complete issue #3 and claim the bounty. We recommend checking in on @smartbizlord's progress, and potentially collaborating, before starting a new solution.

I see where it is erroring out, and believe I have a potential solution.
Before I create a PR, could I get context around, this particular if statement, why are you attempting to kill a process that has already exited/doesn't exist?

if (err.message.includes('could not be terminated') && err.message.includes('There is no running instance of the task.'))

Truthfully I don't remember. I think this was an attempt to fix the problem at hand or work around windows specifically but idk

💡 @Waldeedle submitted a pull request that claims the bounty. You can visit your bounty board to reward.

@Waldeedle: Your claim has been rewarded! 👉 Complete your Algora onboarding to collect the bounty.

@Waldeedle thanks so much for contributing to this project and helping us out! Great work!

@crutchcorn anytime, thanks for the brain teaser! 👏