Compare to native options
vjpr opened this issue · 22 comments
I am interested in how much faster it could be using native code. Via child_process
or N-API.
Someone already mentioned Rust's https://github.com/jessegrosjean/jwalk. Some other Rust discussion here. Would be cool to have a napi-rs
wrapper on it and see the speed diff.
Rust
What’s the fastest way to read a lot of files?
- jwalk
- ripgrep
- grepping util, not limited to walking dirs
- Comparisons
- Uses
crossbeam
for parallelism andignore
for recursive dir walker.
- ignore (docs)
- Is actually inside the ripgrep repo.
- Speed comes from building on Rust's regex engine. (Is it really faster than everything else?)
- walkdir
- fd
- Uses
ignore
and a lot ofripgrep
util crates.
- Uses
- fzf (Golang)
- globwalk
- Based on
ignore
andwalkdir
.
- Based on
C
gitstatusd
- This is a great read about how gitstatusd optimized its dir listing code: Fast directory listing
- https://github.com/romkatv/gitstatus#benchmarks
the_silver_searcher
JS Glob Libs
- https://github.com/terkelg/tiny-glob
- Uses
globrex
~350% faster than node-glob and ~230% faster than fast-glob
- Uses
- https://github.com/mrmlnc/fast-glob
- Uses
micromatch
, can also usepicomatch
- Uses
JS Matcher Libs
- isaacs/minimatch
- micromatch/micromatch
- micromatch/picomatch
- globrex
WASI
No implementation exists yet.
Would be good to try with AssemblyScript.
I considered going native before but the benefit of a 0 dependency purely Javascript library is far greater — I don't have to deal with platform differences or worry about building the module in all platforms. NAPI is a nightmare. Right now, fdir just works and it is fast. Aside from that, I am also not proficient enough in Rust or C or C++ to write high performance code. I am sure native would be faster.
I would love to benchmark of fdir
against other native options. That'd be actually pretty awesome. PRs are welcome 😄
Alright, curiosity got the better of me and I benchmarked jwalk
with fdir
using napi-rs
.
Code
I am no expert in Rust so this code is probably sub-par.
#![deny(clippy::all)]
#[macro_use]
extern crate napi_derive;
use napi::{CallContext, JsObject, JsUndefined, Result};
use jwalk::{Parallelism, WalkDir};
#[module_exports]
fn init(mut exports: JsObject) -> Result<()> {
exports.create_named_method("jwalk", jwalk)?;
Ok(())
}
#[js_function(1)]
fn jwalk(ctx: CallContext) -> Result<JsUndefined> {
// let dir = ctx.get::<JsString>(0)?.into_utf8()?;
let mut entries: Vec<String> = vec![];
for entry in WalkDir::new("/")
.parallelism(Parallelism::RayonNewPool(8))
.skip_hidden(false)
{
match entry {
Ok(entry) => entries.insert(entries.len(), entry.path().display().to_string()),
Err(_err) => {
println!("failed to access entry");
}
}
}
ctx.env.get_undefined()
}
I tried to be as fair as possible but my limited knowledge of Rust probably doesn't help. Maybe @Brooooooklyn can take a look and advise some improvements to the Rust code.
The benchmark code:
const { fdir } = require("./index.js");
const { jwalk } = require("fdir-bench");
(async () => {
console.time("fdir");
for (let i = 0; i < 5; ++i) {
await new fdir()
.withBasePath()
.withDirs()
.crawl("/")
.withPromise();
}
console.timeEnd("fdir");
console.time("jwalk");
for (let i = 0; i < 5; ++i) {
jwalk("/");
}
console.timeEnd("jwalk");
})();
Of course, this is very simplistic but it does the job.
Results
fdir: 7.795s
jwalk: 6.958s
Number of entries found:
fdir: 1685964 (dirs & files)
jwalk: 1689508 (dirs & files)
Thoughts
Well, fdir
is pretty fast. Only 1.2s slower than jwalk
. I am certain after a few optimizations fdir
can beat jwalk
.
Try this code to avoid re-allocated in Rust
let result: Vec<String> = WalkDir::new("/")
.parallelism(Parallelism::RayonNewPool(8))
.skip_hidden(false)
.filter_map(|dir_entry| dir_entry.ok().map(|entry| entry.path().display().to_string().to_string()))
.collect();
Unfortunately getting this:
error[E0599]: the method `filter_map` exists for struct `WalkDirGeneric<((), ())>`, but its trait bounds were not satisfied
--> src/lib.rs:25:10
|
25 | .filter_map(|dir_entry| {
| ^^^^^^^^^^ method cannot be called on `WalkDirGeneric<((), ())>` due to unsatisfied trait bounds
|
::: /home/thecodrr/.cargo/registry/src/github.com-1ecc6299db9ec823/jwalk-0.5.1/src/lib.rs:158:1
|
158 | pub struct WalkDirGeneric<C: ClientState> {
| ----------------------------------------- doesn't satisfy `WalkDirGeneric<((), ())>: Iterator`
|
= note: the following trait bounds were not satisfied:
`WalkDirGeneric<((), ())>: Iterator`
which is required by `&mut WalkDirGeneric<((), ())>: Iterator`
Ah, had to do into_iter
.
Any difference in results?
Also might be useful to add a pattern filter such as '**/package.json'.
Did some testing of my own.
MacBook Pro 16 - 2.4 GHz 8-Core Intel Core i9
jwalk - cargo run (unoptimized + debuginfo)
16 threads = 3738ms
jwalk - cargo run --release
8 threads = 2363ms
16 threads = 1732ms
jwalk - napi (returning results)
817,100 files
16 threads = 2369ms
fdir
773,467 files
2340ms
jwalk - cargo run - filter node_modules
300ms
jwalk - cargo run - filter node_modules + only package.json
...
fdir - filter node_modules
968ms
fdir - filter node_modules + only package.json
1020ms
It's actually very impressive how close they are considering one is running natively...for a large number of files.
Although when globbing is involved, jwalk
is winning by quite a bit.
Tried using both .glob
and .filter
to search for package.json
and I couldn't get it below 900ms.
I also found that using the Rust library fd
I can get 180ms which is impressive. Seems like all these libraries are adding a lot of overhead somewhere.
time fd package.json > /dev/null
fd package.json > /dev/null 0.17s user 0.18s system 824% cpu 0.043 total
fd
seems to use the ignore
package to walk dirs (which is actually used in ripgrep
) which jwalk
is suppose to be faster than.
My use case is to speed up the pnpm
package manager reading all package.json
files on startup. Currently they use fast-glob
which uses micromatch
. Because this has to run on startup, 200ms makes a big difference compared 1s.
Going to try out ignore
as a native lib. Seeing if there is a good api to use: BurntSushi/ripgrep#2028
Could also re-use the glob libs from ignore/ripgrep
with jwalk
and it should be fast enough. jwalk
has a simple parallel API.
Maybe worth looking at using tiny-glob (globrex)
for fdir
as it claims to be a lot faster.
@vjpr Can you share your benchmark code? I'd like to take a look and see what is causing that huge bottleneck.
After a couple "optimizations", here are the results:
jwalk: 2.6862s
ignore: 5.747s
fdir: 3.2936s
These results are average of 5 iterations on the root directory containing 1.6million files & about 300k directories.
jwalk
is very fast. fdir
gets really close but not quite there. I haven't tried using worker_threads
yet for directory traversal so perhaps there is opportunity for performance there. ignore
is slow. I have no idea how fd
is getting such performance? They say they are doing it in parallel which is also being done by jwalk
(via rayon) & fdir
(via libuv).
@vjpr So I went ahead and did a benchmark with fd
which you claimed to do everything within 200ms. Here are the results:
time fdfind --hidden --no-ignore package.json / > /dev/null
real 0m1.203s
user 0m4.639s
sys 0m4.378s
And I did the same thing with fdir
:
time UV_THREADPOOL_SIZE=2 NODE_ENV=production node test.js
real 0m1.226s
user 0m2.097s
sys 0m1.464s
Both returned around the same amount (26k) of results.
Now, maybe I did it wrong but clearly fdir
is faster than or about the same as fd
in terms of speed. Of course, fd
is doing plenty of other things so it isn't exactly fair but still...
Here's the code for the above fdir
script:
await new fdir()
.withBasePath()
.filter((path) => path.indexOf("package.json") > -1)
.crawl("/")
.withPromise();
One thing I noticed: using UV_THREADPOOL_SIZE=2
gave the best performance. Although it wasn't that much worse with default threadpool size either but if I increased it to 8, the crawling speed went down 2x.
@thecodrr What repo were you using to test?
Also just to note, default UV_THREADPOOL_SIZE=4
.
I didn't use any repo. I ran it in my root. Yes, that is 1.6 million files & 300k directories — bigger than 99% of the repositories which means more pattern failures. If you run it inside fdir
repo, it completes in under 12-13 ms.
Which repo did you test on?
Also just to note, default UV_THREADPOOL_SIZE=4.
Yes, I know. That's what surprised me. No idea why size of 2 gave a slightly better perf. The difference is too negligible on small directories though.
I'm using a large internal monorepo. We should find a nice and large public Node.js monorepo to use for tests.
So for me, increasing UV_THREADPOOL_SIZE=8
, fdir
starts beating jwalk
, which is cool!
What machine are you using?
I'm still yet to test out ignore
with crossbeam parallelism which seems to get 200ms.
I'm using a large internal monorepo. We should find a nice and large public Node.js monorepo to use for tests.
What do you suggest? React? Jest?
So for me, increasing UV_THREADPOOL_SIZE=8, fdir starts beating jwalk, which is cool!
Very interesting.
What machine are you using?
Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz with 16 GB RAM and of course an SSD (a mediocre one). And I am on Linux.
So ignore
with parallel is about 200ms faster than fdir
. 12 threads.
fn get_overrides(source: String) -> Override {
let mut override_builder = OverrideBuilder::new(source.clone());
let exclude_patterns = vec!["**/package.json", "!**/node_modules/**"];
for pattern in exclude_patterns {
override_builder
.add(pattern)
.map_err(|e| anyhow!("Malformed exclude pattern: {}", e));
}
let overrides = override_builder
.build()
.map_err(|_| anyhow!("Mismatch in exclude patterns"));
overrides.unwrap()
}
pub fn walk_parallel(source: String) -> Vec<String> {
let overrides = get_overrides(source.clone());
let (tx, rx) = mpsc::channel();
WalkBuilder::new(source)
.hidden(false)
.overrides(overrides)
.standard_filters(false)
.threads(cmp::min(12, num_cpus::get()))
.build_parallel()
.run(move || {
let tx = tx.clone();
Box::new(move |dir_entry_result| {
if let Ok(dir_entry) = dir_entry_result {
tx.send(dir_entry.file_name().to_owned()).unwrap();
}
ignore::WalkState::Continue
})
});
let mut metadatas: Vec<_> = rx.into_iter().map(|dir| dir.into_string().unwrap()).collect();
//metadatas.sort_by(|a, b| a.len().cmp(&b.len()))
metadatas
}
I'm putting together a repo now.
Linux
There was some discussion here about macOS APFS being slower:
https://users.rust-lang.org/t/whats-the-fastest-way-to-read-a-lot-of-files/39743/4?u=vjpr
...leads me to the conclusion that APFS obtains a global kernel lock during read-only operations such as readdir(). In other words, APFS slows down when attempting to perform parallel read-only I/O.
...
It is apparent that macOS 10.14 Mojave has received performance work relative to macOS 10.13
https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/
https://news.ycombinator.com/item?id=18332260
Although could be just per-directory lock for readdir
as someone mentions.
So ignore with parallel is about 200ms faster than fdir. 12 threads.
That's a lot of effort 😆. I put together a very, very rough parallel directory crawling in node using cluster
s to get 1.4s. Without clusters, I was getting around 6 seconds. fdir
can touch 1.4s without clusters. Guess where fdir
would be after proper parallelism...
Nice. Cluster seems like the way to go to reach the Node.js max perf. Although for CLI cold-start use cases, would depend on how long it takes the process pool to spin up which might negate the benefit.
My particular use case is 'pnpm' package manager. Many commands need to read the entire tree on startup, so if i could get that down to <500ms for large monorepos it would be great.
FYI, not sure workers will help.
Workers (threads) are useful for performing CPU-intensive JavaScript operations. They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.
https://nodejs.org/api/worker_threads.html#worker-threads
so if i could get that down to <500ms for large monorepos it would be great.
You'll have to give an example of a large monorepo so I can test it. To the best of my knowledge, you can easily traverse a fairly large monorepo under 500ms. That's excluding symlink resolution. I haven't tested with symlink resolution but it should be much, much slower.
FYI, not sure workers will help.
Yes, but child_process might. Have to check and see. I don't expect a huge difference mostly because libuv has all the threads maxed out most of the time. But doesn't hurt to try.
Although for CLI cold-start use cases, would depend on how long it takes the process pool to spin up
This is a real concern.
The main bottleneck is readdir
and as far as I know, there are no faster alternatives in node yet. At the end of the day, everything is limited by IO so there is a point after which the quest for performance becomes foolish.
The main bottleneck is
readdir
and as far as I know, there are no faster alternatives in node yet.
fs.readdir
is the most direct mapping to the underlying syscall readdir(3), so this will always be a limit in a Node env.
The only potential alternative would be WASI, but the Node.js implementation just uses libuv
under the hood anyway. However when using WASI the regex logic could be moved into WASM which might improve things. AssemblyScript could be used to write the WASM in TypeScript rather than needing a syntax change.
If we don't restrict ourselves to cold-start, we could use a persistent daemon that watches the file system for changes. This is a way to get instant speed.
- a. On file change, re-walk the entire tree. CLI queries daemon for instant response.
- High background CPU load.
- b. On file change, run matchers on only the changed files. CLI queries daemon for instant response.
- Less background CPU load.
- c. On CLI start, load a cache, query daemon for file changes, and apply. I use FB's
watchman
package for this at the moment.- No background CPU load.
I think that covers the entire solution space for reading dir trees :D
My benchmarks:
robocopy.exe with /log:output.txt - 7s
fdir - 2 minutes
On a volume with around 1,000,000 files.
@PlkMarudny whats your code for fdir and what are you benchmarking through?