thecodrr/fdir

Stream API / Async Iterator API

papb opened this issue ยท 10 comments

papb commented

Hello, first of all thanks for this awesome package!!

The documentation says that

Stream API will be added soon.

Is it already a work in progress? I would really like to see this.

Also, I want to suggest providing an Async Iterator API instead of a stream API. The reason is that an Async Iterator can be easily converted into a readable stream using into-stream without loss of performance (into-stream automatically handles backpressure and all stream quirks), while the opposite conversion is very nontrivial (actually I think it's impossible, since readable streams start filling their internal buffers at soon as they begin flowing and therefore can't be converted to a one-step-at-a-time async iterator).

Hey!

Thank you for the suggestion. Yes, that is the plan; to implement Async Iterator API. It will require a bit of redesign I think but I will have a version up and ready soon.

papb commented

@thecodrr Awesome! If you need some help let me know (although I am not sure I will have time to help ๐Ÿ˜…)

After some "alpha" testing of the sync iterator api, there is a little bit of a performance degradation. However, I don't think Iterator APIs should be used to get a final result โ€” their biggest strength is streamed reading where you can pause or kill at any moment.

In any case, it is doable of course and the changes are minor as well. The only issue is to reduce code redundancy without affecting the async or sync implementations. I do not want to move the async/sync APIs over to use iterators because that's a very, very real performance tradeoff.

Another thought is that perhaps the current implementation isn't best suited for iterators? I mean, sure, it is fast but maybe iterators can perform better or the same if they were implemented completely differently? Right now there is a huge back & forth which makes us add multiple yield*. If we could re-implement such that there were as few yield* as possible, that'd improve performance by a lot.

vjpr commented

I'm interested to test out any developments of this.

A use case might be searching with early-exit. Say I have a monorepo of package.json files and I'm looking for one by name. Very common use case with pnpm --filter. I want to glob, and then read each pjson as soon as they arrive, and potentially stop the search when I find a match.

I don't think back-pressure is a concern with globbing - most glob operation results can easily fit in memory because its just a bunch of file paths. 1MM file paths would be in the order of 100MBs. Early-exit would be useful to reduce unnecessary memory consumption, but I can't imagine the use in pausing the globbing process for memory concerns while doing some processing down the line.

So just buffer everything at full speed, and then when the first result is found, expose an async iterator to pull from available results.

And emit events for stream. I'm not sure exactly the implications of streams vs iterators.

I read an article though about Async Iterator being bad for perf: https://medium.com/netscape/async-iterators-these-promises-are-killing-my-performance-4767df03d85b. Says batching is needed to optimize performance.

vjpr commented

into-streams

There is a built-in method to from iterators to streams: https://nodejs.org/api/stream.html#streamreadablefromiterable-options

But maybe there is not as much control over chunking...


There is also ReadableStream

The WHATWG Streams Standard (or "web streams") defines an API for handling streaming data. It is similar to the Node.js Streams API but emerged later and has become the "standard" API for streaming data across many JavaScript environments.

Looks like the standard way moving forward. But its experimental at present...

vjpr commented

For basic streaming support we just need to add an emitter here:

fdir/src/api/walker.js

Lines 144 to 160 in 3598e83

Walker.prototype.buildPushFile = function buildPushFile(
filters,
onlyCountsVar,
excludeFiles
) {
if (excludeFiles) return;
if (filters.length) {
this.pushFile = onlyCountsVar
? fns.pushFileFilterAndCount
: fns.pushFileFilter;
} else if (onlyCountsVar) {
this.pushFile = fns.pushFileCount;
} else {
this.pushFile = fns.pushFile;
}
};

Or here might be better:

fdir/src/api/walker.js

Lines 65 to 100 in 3598e83

Walker.prototype.processDirents = function processDirents(
dirents,
directoryPath,
currentDepth
) {
this.pushDir(this, directoryPath, this.state.paths);
const files = this.getArray(this.state);
for (var i = 0; i < dirents.length; ++i) {
const dirent = dirents[i];
if (dirent.isFile()) {
const filename = this.joinPath(dirent.name, directoryPath);
this.pushFile(this, filename, files);
} else if (dirent.isDirectory()) {
let path = fns.joinPathWithBasePath(dirent.name, directoryPath);
this.walkDir(this, path, dirent.name, currentDepth - 1);
}
// perf: we can avoid entering the condition block if .withSymlinks is not set
// by using symlinkResolver !== fns.empty; this helps us avoid wasted allocations -
// which are probably very minor but still.
else if (dirent.isSymbolicLink() && this.symlinkResolver !== fns.empty) {
let path = fns.joinPathWithBasePath(dirent.name, directoryPath);
this.symlinkResolver(path, this.state, (stat, resolvedPath) => {
if (stat.isFile()) {
this.pushFile(this, resolvedPath, files);
} else if (stat.isDirectory()) {
this.walkDir(this, resolvedPath, dirent.name, currentDepth - 1);
}
});
}
}
this.groupFiles(directoryPath, files, this.state);
};

This would allow use to add debugging information about symlink resolution which could be important for performance debugging. I would like the ability to log if a file was skipped because it was a symlink and symlink was disabled.

I would like the ability to log if a file was skipped because it was a symlink and symlink was disabled.

This doesn't "sound" good for performance though but I haven't tested. The way the Builder API is structured allows for extensions like these but I am fearful of cluttering the code to the point where it starts to make less sense. That is why I am planning to refactor the code-base a little bit to allow "plugins". The core crawling will act as a base for all these extra features like relative paths, filtering, globbing etc. Each plugin would be fairly independent and won't clutter the code โ€”not to mention that it would allow 3rd party plugins into the fdir crawling process.

I read an article though about Async Iterator being bad for perf

The testing I have done implies the same. In a very crude sense, you could say that the more you yield the slower it'll become. However, I think performance can improve a little bit if we don't push into 1 global array of strings for every file and instead have an array of paths for the current directory only. That array could be yielded โ€” acting essentially as a batch of paths.

However, the main benefit of Iterators is the ability to work on a single item at a time and bail out when that work is finished (or continue if not). I will have to do more benchmarking to see which approach is the best. I do know, however, that pushing lots of strings into an array takes some time and if that can be swapped for yield without a performance penalty, fdir might become faster than jwalk.

An async iterator is probably "bad" for CPU performance - but good for memory usage. This might matter if you're scanning millions of files, although most systems have memory in abundance these days.

Presumably, the goal with this feature is to have scripts provide feedback as quickly as possible, as well as maybe saving some memory.

What if we were to collect some number of results before feeding them through? Or collecting in time intervals? Or even both?

const files = new fdir()
  .crawl("/path/to/dir")
  .stream(1000, 100); // 1000 entries or 100 milliseconds

for (const chunk of files) {
  for (const file of files) {
    // ...
  }
}

The files would be an iterator that yields every 1000 results or 100 milliseconds, ensuring interactive scripts - and each chunk would be an array, for speed.

If the iterator yields 1000 chunks for a million files, that's unlikely to create either substantial memory or CPU overhead? ๐Ÿ™‚