[node-core-library] LockFile.acquire fails irregularly when multiple processes try to acquire lock
andrew--r opened this issue · 3 comments
Summary
Hi! A little context: we have a rush-managed monorepo with multiple projects. For most of the projects, build step involves calling a certain external API. Multiple parallel calls to this API lead to API rate limit errors, that’s why we’re trying to wrap those calls with LockFile from @rushstack/node-core-library
to prevent making multiple calls in parallel.
The problem is that sometimes our CI fails with the following error:
Stack trace
[2024-01-24T11:09:59.217Z] node:internal/fs/sync:58
[2024-01-24T11:09:59.217Z] const stats = binding.statSync(
[2024-01-24T11:09:59.217Z] ^
[2024-01-24T11:09:59.217Z]
[2024-01-24T11:09:59.217Z] Error: File does not exist: /home/ubuntu/workspace/web-monorepo/packages/joom-crowdin/src/lib/lock/joom-crowdin#2650.lock
[2024-01-24T11:09:59.217Z] ENOENT: no such file or directory, stat '/home/ubuntu/workspace/web-monorepo/packages/joom-crowdin/src/lib/lock/joom-crowdin#2650.lock'
[2024-01-24T11:09:59.217Z] at Object.stat (node:internal/fs/sync:58:25)
[2024-01-24T11:09:59.217Z] at Object.statSync (node:fs:1669:17)
[2024-01-24T11:09:59.217Z] at Object.<anonymous> (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/graceful-fs@4.2.10/node_modules/graceful-fs/polyfills.js:319:16)
[2024-01-24T11:09:59.217Z] at /home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/FileSystem.js:152:24
[2024-01-24T11:09:59.217Z] at FileSystem._wrapException (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/FileSystem.js:1050:20)
[2024-01-24T11:09:59.217Z] at FileSystem.getStatistics (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/FileSystem.js:151:27)
[2024-01-24T11:09:59.217Z] at LockFile._tryAcquireMacOrLinux (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:256:64)
[2024-01-24T11:09:59.217Z] at LockFile.tryAcquire (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:202:29)
[2024-01-24T11:09:59.217Z] at retryLoop (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:224:35)
[2024-01-24T11:09:59.217Z] at LockFile.acquire (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:234:16) {
[2024-01-24T11:09:59.217Z] errno: -2,
[2024-01-24T11:09:59.217Z] code: 'ENOENT',
[2024-01-24T11:09:59.217Z] syscall: 'stat',
[2024-01-24T11:09:59.217Z] path: '/home/ubuntu/workspace/web-monorepo/packages/joom-crowdin/src/lib/lock/joom-crowdin#2650.lock'
[2024-01-24T11:09:59.217Z] }
[2024-01-24T11:09:59.217Z]
[2024-01-24T11:09:59.217Z] Node.js v20.9.0
[2024-01-24T11:09:59.217Z] Returned error code: 1
Repro steps
I was not able to reproduce this error reliably, it happens randomly. I was able to catch it a few times on macOS besides those errors in CI (which runs on Linux).
Here is the code that I used to reproduce the error: https://gist.github.com/andrew--r/6e45f2b1661cabec9ea7d0d0bb972e50
Update: see reproduction steps
Details
The stack trace leads to this line of code:
As I understood, there is some logic that deletes other processes lockfiles under certain conditions:
rushstack/libraries/node-core-library/src/LockFile.ts
Lines 329 to 334 in a1c8333
My hypothesis is that the lockfile is deleted by other process between the calls to lockFileHandle.write(startTime)
and FileSystem.getStatistics(pidLockFilePath)
, which leads to error, but I’m not entirely sure about it. Either way, I think wrapping these lines with try/catch (like here) and returning from catch
should solve the problem.
Update: see detailed description of the race condition causing error
Standard questions
Please answer these questions to help us investigate your issue more quickly:
Question | Answer |
---|---|
Package name: | @rushstack/node-core-library |
Package version? | 3.61.0 |
Operating system? | Mac / Linux |
Would you consider contributing a PR? | Yes |
Node.js version (node -v )? |
20.9.0 |
You could also consider wrapping it in Async.runWithRetriesAsync
.
Would you be interested in putting together a PR to fix this?
Sure, I’ll make a PR this week!
I’ve dug a bit deeper into the problem, and I think I’ve managed to find the root cause.
The problem occurs when at least two processes constantly try to acquire the same lock, e.g. when processing a task queue inside each process.
Reproduction steps:
- The first process acquires lock and starts executing something (e.g. current task).
- The second process simultaneously tries to acquire the same lock, writes it’s lockfile, …, and scans directory in search for other lockfiles.
- Immediately after it, the first process lock is released, which leads to the first lockfile being deleted
- The second lock processes each file found on the step 2. The first lockfile doesn’t exist anymore, so it fails to retrieve info about the first process lock, and the error is silently swallowed
- The first process takes the next task from the queue and starts acquiring new lock, writing a new lockfile
- The second process uses obsolete data about the first process lockfile and deletes it with an assumption that the first process is no longer executing, but the lockfile still exists for some reason
- The first process tries to retrieve stats for it’s new lockfile, which was deleted on the previous step by the second process, which finally leads to
ENOENT
error