microsoft/rushstack

[node-core-library] LockFile.acquire fails irregularly when multiple processes try to acquire lock

andrew--r opened this issue · 3 comments

Summary

Hi! A little context: we have a rush-managed monorepo with multiple projects. For most of the projects, build step involves calling a certain external API. Multiple parallel calls to this API lead to API rate limit errors, that’s why we’re trying to wrap those calls with LockFile from @rushstack/node-core-library to prevent making multiple calls in parallel.

The problem is that sometimes our CI fails with the following error:

Stack trace
[2024-01-24T11:09:59.217Z] node:internal/fs/sync:58
[2024-01-24T11:09:59.217Z]   const stats = binding.statSync(
[2024-01-24T11:09:59.217Z]                         ^
[2024-01-24T11:09:59.217Z] 
[2024-01-24T11:09:59.217Z] Error: File does not exist: /home/ubuntu/workspace/web-monorepo/packages/joom-crowdin/src/lib/lock/joom-crowdin#2650.lock
[2024-01-24T11:09:59.217Z] ENOENT: no such file or directory, stat '/home/ubuntu/workspace/web-monorepo/packages/joom-crowdin/src/lib/lock/joom-crowdin#2650.lock'
[2024-01-24T11:09:59.217Z]     at Object.stat (node:internal/fs/sync:58:25)
[2024-01-24T11:09:59.217Z]     at Object.statSync (node:fs:1669:17)
[2024-01-24T11:09:59.217Z]     at Object.<anonymous> (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/graceful-fs@4.2.10/node_modules/graceful-fs/polyfills.js:319:16)
[2024-01-24T11:09:59.217Z]     at /home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/FileSystem.js:152:24
[2024-01-24T11:09:59.217Z]     at FileSystem._wrapException (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/FileSystem.js:1050:20)
[2024-01-24T11:09:59.217Z]     at FileSystem.getStatistics (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/FileSystem.js:151:27)
[2024-01-24T11:09:59.217Z]     at LockFile._tryAcquireMacOrLinux (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:256:64)
[2024-01-24T11:09:59.217Z]     at LockFile.tryAcquire (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:202:29)
[2024-01-24T11:09:59.217Z]     at retryLoop (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:224:35)
[2024-01-24T11:09:59.217Z]     at LockFile.acquire (/home/ubuntu/workspace/web-monorepo/common/temp/node_modules/.pnpm/@rushstack+node-core-library@3.61.0_@types+node@20.9.0/node_modules/@rushstack/node-core-library/lib/LockFile.js:234:16) {
[2024-01-24T11:09:59.217Z]   errno: -2,
[2024-01-24T11:09:59.217Z]   code: 'ENOENT',
[2024-01-24T11:09:59.217Z]   syscall: 'stat',
[2024-01-24T11:09:59.217Z]   path: '/home/ubuntu/workspace/web-monorepo/packages/joom-crowdin/src/lib/lock/joom-crowdin#2650.lock'
[2024-01-24T11:09:59.217Z] }
[2024-01-24T11:09:59.217Z] 
[2024-01-24T11:09:59.217Z] Node.js v20.9.0
[2024-01-24T11:09:59.217Z] Returned error code: 1

Repro steps

I was not able to reproduce this error reliably, it happens randomly. I was able to catch it a few times on macOS besides those errors in CI (which runs on Linux).

Here is the code that I used to reproduce the error: https://gist.github.com/andrew--r/6e45f2b1661cabec9ea7d0d0bb972e50

Update: see reproduction steps

Details

The stack trace leads to this line of code:

const currentBirthTimeMs: number = FileSystem.getStatistics(pidLockFilePath).birthtime.getTime();

As I understood, there is some logic that deletes other processes lockfiles under certain conditions:

// this means the process is no longer executing, delete the file
if (!otherPidCurrentStartTime || otherPidOldStartTime !== otherPidCurrentStartTime) {
// console.log(`Other pid ${otherPid} is no longer executing!`);
FileSystem.deleteFile(fileInFolderPath);
continue;
}

My hypothesis is that the lockfile is deleted by other process between the calls to lockFileHandle.write(startTime) and FileSystem.getStatistics(pidLockFilePath), which leads to error, but I’m not entirely sure about it. Either way, I think wrapping these lines with try/catch (like here) and returning from catch should solve the problem.

Update: see detailed description of the race condition causing error

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question Answer
Package name: @rushstack/node-core-library
Package version? 3.61.0
Operating system? Mac / Linux
Would you consider contributing a PR? Yes
Node.js version (node -v)? 20.9.0

You could also consider wrapping it in Async.runWithRetriesAsync.

Would you be interested in putting together a PR to fix this?

Sure, I’ll make a PR this week!

I’ve dug a bit deeper into the problem, and I think I’ve managed to find the root cause.

The problem occurs when at least two processes constantly try to acquire the same lock, e.g. when processing a task queue inside each process.

Reproduction steps:

  1. The first process acquires lock and starts executing something (e.g. current task).
  2. The second process simultaneously tries to acquire the same lock, writes it’s lockfile, …, and scans directory in search for other lockfiles.
  3. Immediately after it, the first process lock is released, which leads to the first lockfile being deleted
  4. The second lock processes each file found on the step 2. The first lockfile doesn’t exist anymore, so it fails to retrieve info about the first process lock, and the error is silently swallowed
  5. The first process takes the next task from the queue and starts acquiring new lock, writing a new lockfile
  6. The second process uses obsolete data about the first process lockfile and deletes it with an assumption that the first process is no longer executing, but the lockfile still exists for some reason
  7. The first process tries to retrieve stats for it’s new lockfile, which was deleted on the previous step by the second process, which finally leads to ENOENT error