Thoughts on performance

Question

Thoughts on performance

ev-dev opened this issue 2 years ago · 3 comments

I'll preface this comment with a disclaimer that I have a very high-level understanding of filesystems (compared to those who work on projects like these anyway), and that I'm very impressed with how much faster erdtree is on my machine (win11, pwsh) compared to similar projects like gdu/pdu/dua/etc in most cases.

There is however one project out there that in my experience, continues to trump all others in terms of performance for simple disk usage stats, and it's baffled me for long enough now that I really want to understand more and maybe publicize whatever tricks might be going on in this little-known gem.

It's called byenow (https://github.com/apankrat/byenow) and yes: it's not actually a disk usage tool, it is actually mostly closed source, and worst of all it's a Windows-only project...

But hear me out because it's about an order of magnitude faster than anything else I've tried (caveats below).

It's a cli tool for parallelized rapid file deletion that happens to include optional flags: --preview (prevents any actual deleting) and --show-bytes which reports disk usage stats in its output. The caveat is that there is no breakdown in the size distribution per file/dir, but as far as I can tell that is merely a side effect of the tool being targeted for a different purpose. My point is that the underlying techniques being using for traversal/blocksize calculation seems to be the real magic.

Here's a sample of what I'm talking about, a cold run on a dir with >650k subdirectories and >1.8M files which finishes in just over 11sec, the same dir takes >2min to complete with erdtree and others (given options that would produce similar output as below, ie. depth of 1, max thread count, etc):

I know that given the OS-specificness & incompleteness in the projects' source code, my comment may not be of interest to anyone here, so feel free to close right away. Maybe it's just my envy of systems-level engineers but I figured I'd throw it out here and see if any IO wizards or someone found it interesting.

Thanks to those working on this project btw. It's very useful & pretty too!

Answer 1 · 2023-05-10T00:27:39.000Z

Thanks for sharing! I'll leave this issue open for the time-being to facilitate discussion. I'll share some little nuggets below to give some insight about the internal workings of erdtree and how it ultimately relates to performance and how that compares to byenow.

The algorithm

On a high-level erdtree starts by spinning up a threadpool that is dedicated to disk-I/O which includes fetching a file's data, its metadata, its extended attributes, as well as following symbolic links. Once that data is in memory it is then transformed into erdtree's virtual representation of these files (i.e. a Node), still in the threadpool.

Once all of the files have been read into memory and transformed into Nodes, we then assemble the tree data-structure to create a virtual representation of the filesystem without any further I/O; any sort of granular filtering such as filtering filenames via regular expressions happens here as well. Once the tree is assembled we traverse the tree in order to compute the size of each directory, doing a bit of extra work to not double-count hard links.

Why a threadpool for disk-I/O

The reason we're leveraging a threadpool despite the nature of serial request processing of the disk is because we're aiming to saturate the disk depth queue, which keeps the disk busy, allowing it to process requests in aggregate rather than having it wait in-between erdtree's CPU-bound processing and the sending of the next request. This has been shown empirically to result in better throughput.

The tree data structure

erdtree leverages a single-vector arena-based tree as opposed to a heap-based tree. A heap-based tree would have Node's holding pointers to a vector of Nodes (a directory and its children) which would result in a lot of indirection and cache misses while traversing the tree to assemble the output. An arena-based tree, because of its single-vector nature, is theoretically better for performance because Nodes are tightly packed, all living on adjacent cache-lines (i.e. better cache locality) meaning we'd likely make less round-trips all the way out to main memory from CPU-land.

edit: Important to note that better utilization of the CPU-cache is only postulation. I'm pretty confident that this is the case but indexing in a non-linear manner isn't a super predictable data-access pattern which hurts the CPU's ability to do speculative pre-fetching i.e. eager loading of data from main memory into the CPU caches.

edit II: Continuing from the last edit: but still, due to the nature of how memory is actually loaded into the CPU caches via cache lines we're still getting contiguous Nodes being loaded in together when one is queried as they're all in a single vector.

The downsides of the algorithm

The detriment of parallelization is that we lose information about the structure of the filesystem and have to extrapolate it ourselves once we have all of the data from disk. It's a bit of extra work, but we're weighing it against the cost of doing a single-threaded linear traversal that is constantly blocked on disk-I/O in between the CPU-bound construction of the Nodes. Parallelization ultimately resulted in better performance.

With regards to directories: Because the filesystem doesn't hold information about a directory's size erdtree has to manually compute it with a full-depth tree traversal; and on top of that, it also has to do a full-depth traversal once more in order to generate the output. We're essentially traversing the tree twice throughout the lifetime of the program.

Fundamental differences with `byenow`

byenow doesn't have to worry about computing directory sizes. It just needs to get data from disk, record its size, increment the total, and move on to the next. It's a very straightforward and linear task. So notwithstanding all of the additional processing erdtree needs to do in order to say filter, look pretty, and generate an output, I'd be willing to bet that at the I/O level the two programs are pretty comparable in terms of performance. Happy to be convinced otherwise.

How `erdtree` performance can be improved

I'm not sure at the moment but I'm constantly thinking through several ideas in my free-time :D

Thanks for writing this up btw! Let me know if you have any further thoughts or questions.

Answer 2 · 2023-05-10T05:15:04.000Z

Hey what an awesome response thank you! It makes much more sense to me now, you definitely made it clear how much less complex the work is for what byenow does during it's size calc really appreciate that.

I guess I'm realizing that maybe my use case of wanting to fetch the total size of a directory (while not needing to know anything about the directory content's hierarchy or any other details), is a case where it's hard to prevent doing more work than needed without special implementation for disk usage tools like erdtree which primarily focus on providing more file/filetree details.

Also thanks for giving me a few more rabbit holes to learn about with links and all 👍

Answer 3 · 2023-05-10T06:14:21.000Z

I'm curious if there are optimizations for cases where a max depth limit is specified. Does a simpler more linear approach to the size calculation kick-in for the subdirectories below that limit level since its contents' details don't need to be stored/recalled?

Maybe that wouldn't make sense to do or wouldn't provide any efficiency, and maybe it's already a thing. I definitely don't have enough depth (no pun intended) of understanding on the matter and the perf tradeoffs involved, but it came to mind while thinking abstractly about the problem. I'm sure if an optimization like that were easy and sensible, it would probably already be incorporated.

It just seemed a bit odd to me at first that most disk usage tools don't seem to implement some kind of more optimized technique specifically tailored to the case of summarizing a single directory's total size (ie. specifying --level 0 or similar), except for maybe gdu (which at least anecdotally seems faster in this specific case compared to related tools, especially for directories with huge file counts or nestedness).

But now I at least understand some more about the difference in motivations/possibilities, so that's mostly satisfied my curiosity around the matter for now I suppose. Sorry for the rants, I just really wanted to stop using a tool designed to permanently delete my files just to know how big a folder is. I know I'm often just one missing command flag away (byenow's --preview) from accidentally destroying my filesystem, but I just can't resist the speed :P