apify/crawlee

Node crash on Crawlee running fs.stat on a request_queue lock file

Clearmist opened this issue · 4 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

The crawler, while running, will randomly crash Node. I tried using the experimental option of disabling locking, but it still happens. I doubt this is a permission issue because my user has write permission to this entire directory structure and I've also tried running as administrator.

I'm okay if I don't get the root of this issue fixed. At the least I'd like to know where I can put a try/catch so this error doesn't crash Node and the crawler can continue.

Obviously Node is trying to get file information from a lock file and dies.

node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: EPERM: operation not permitted, stat 'C:\Users\{username}\Repositories\crawler-app\storage\request_queues\2fdd8a2d-a180-48a1-9f36-28d5a2793b36\y0jxi0Gs1ISlI1y.json.lock'] {
  errno: -4048,
  code: 'EPERM',
  syscall: 'stat',
  path: 'C:\\Users\\{username}\\Repositories\\crawler-app\\storage\\request_queues\\2fdd8a2d-a180-48a1-9f36-28d5a2793b36\\y0jxi0Gs1ISlI1y.json.lock'
}
  1. Start a Cheerio crawler instance with a custom request queue name on a Windows machine.

Code sample

import { randomUUID } from 'node:crypto';
import { app } from 'electron';

const alias = randomUUID();

const address = 'https://{testing-address}';

const config = new Configuration({
  storageClientOptions: {
    localDataDirectory: path.join(app.getPath('userData'), 'crawlerStorage'),
  },
});

const requestQueue = await RequestQueue.open(alias);

await requestQueue.addRequest({ url: address });

const options = {
  experiments: {
    // Request locking is enabled by default since 3.10.0.
    // I've tried setting it to false and it still locks request json files.
    requestLocking: false,
  },
  requestQueue,
  ...
};

const crawler = new CheerioCrawler(options, config);

await crawler.run();

Package version

3.11.1

Node.js version

20.10.0

Operating system

Windows 10

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

3.11.2-beta.17

Other context

No response

3.11.2-beta.17

Have you really seen that on the latest beta?

// Request locking is enabled by default since 3.10.0.
// I've tried setting it to false and it still locks request json files.
requestLocking: false,

That feature is about something else, what you see are local file locks in memory storage, that's an implementation detail of working with the file system.

cc @vladfrangu, not sure if #2603 was supposed to help with this one too, also, doesn't this mean the lock is not acquirable and we are missing a retry somewhere?

That PR wasn't supposed to help with that, unrelated things.

I don't think I've ever seen a stat error like that ever. Also to note that it looks like the path the user provided isn't being used for the storages if the stacktrace is anything to go by, which semi hints at wrong variable passed somewhere?

Also I'm semi certain we try to lock 3 times before giving up. @Clearmist can you get us a full stack trace please? 🙏

@B4nan, yes I am getting this issue on the @next branch. I was using latest, but moved to @next after seeing the field on the bug report form.

@vladfrangu Good catch about the path being different from what I've set using the localDataDirectory option! I hadn't noticed that. I'll try running with the default value of that option. Okay, it still failed with the same error.

I have the request_queues directory open and see this. Are the json.lock files supposed to be seen as directories by the host OS? Maybe Node running fs.stat on a directory is the reason for the crash.

image

I'd love to get a stack trace, but I tried these three and none of these callbacks were called.

process.on('uncaughtException', (err) => {...
process.on('unhandledRejection', (reason, p) => {...
process.on('SIGINT', () => {...

I even tried wrapping crawler.run() in try/catch.

try {
  await crawler.run();
} catch (error) {
...

Do you know of other ways I can generate a full stack trace when the Node process crashes? Maybe somewhere in Crawlee where I can regularly print a stack trace (if that would help).

I updated to 3.11.3 and this issue is still present.

[Error: EPERM: operation not permitted, stat]
{
  errno: -4048,
  code: 'EPERM',
  syscall: 'stat',
  path: 'C:\\Users\\{username}\\Repositories\\crawler-app\\storage\\request_queues\\nasa3d.arc.nasa.gov\\4fC3CInttKDsieR.json.lock'
}

I can see that the path in the error is not where I told crawlee to store the local data. Here is my configuration object:

const config = new Configuration({
  storageClientOptions: {
    localDataDirectory: path.join(app.getPath('userData'), 'crawlerStorage'),
  },
});

What crawlee uses
C:\Users\{username}\Repositories\crawler-app\storage\

What I told it to use
C:\Users\{username}\AppData\Roaming\Electron\crawlerStorage\

The datasets are stored in the right place, but the request_queues are being stored in the incorrect directory.

Also, the .lock files are showing up as directories in Windows 10.