Consider buffered algorithm to handle recursion more gracefully
DavidJFelix opened this issue · 10 comments
Currently, exa does a depth first tree traversal when operating in recursive/tree mode and waits until full recursion is completed before printing anything. This algorithm can have a negative impact on performance both in terms of runtime and memory use. For larger trees (exa / -lTL4
) there can be a substantial wait and memory impact will be proportional to total files.
I suggest instead that exa use a hybrid of breadth and depth that will give it enough information so that it can begin printing, then buffer that information to output before clearing the memory. For example: suppose we have a standard linux tree and we run exa / -lTL4
. The first thing we should do is create the root of a tree, /
then gather the /
level so that we can begin adding those nodes. We'll get bin boot cdrom dev etc home lib media mnt ...
. We sort this level (by either time or name) then place it in the tree. We then move to the most highly sorted item in that level. For alphabetic sort, this is /bin
. As soon as we move the recursion down to /bin
we will also format a line for bin and pass it to a print queue.
The print queue should be checked and printed to asynchronously within processing to ensure that as soon as output is available at some interval, we initiate printing it. This will improve performance for remote clients, reduce in-memory footprint of exa, and improve the perceived performance by the user. This asynchronous printing could be done by another thread if you wanted, but I think that just convolutes the solution. As long as the printing is initiated somewhere before the program completes it's good. In all honesty, a queue is a very high level abstraction for this. You could literally print each line as it becomes available to print if you wanted.
So, onto /bin
- /bin
has no nested directories, when we gather this level, we end up with bash bunzip busyybox ...
etc. We begin walking this directory and discover bash
is not directory, so we add it to the print queue, which I mentioned above. Since we're now done with bash
we can pop it from the tree -- this is where we begin to see our memory savings. We then do the same for bunzip
, busybox
etc until all of /bin
is done. Then we start this process all over again on /boot
.
The idea behind the tree-traversal model is that you can know the depth (and thus, how to print it) so if you visit nodes in the order that they will be printed, you can immediately print them and then allow the program to forget about them.
Let me know what you think about this. I've been trying to work my way around the code and see how this might be implemented, but I'm still pretty green when it comes to rust. :)
So for some reason I had it in my head that I'd already replied to this when I assigned it to myself. Looking back, that is not the case! So sorry for the late reply to this.
For larger trees (
exa / -lTL4
) there can be a substantial wait and memory impact will be proportional to total files.
Heh, believe me, I'm aware! When I was testing the --tree
feature I had it tree my entire home directory, and it usually just sat there for a nail-biting twenty seconds or so before I saw any output.
Unfortunately, there's one thing that's standing in the way of having nice fluent output, and that's that exa has to format its output into a table before it's able to output anything. You're right that this could be fixed for just the --tree
option, but the --tree --long
combination requires a lot more processing.
The --long
output format lists by default the permission bits, the size, the user, and the date modified. It then scans through each row and column of the output in order to find each column's maximum width, and only then can it actually start printing any output with each column padded with spaces to make sure it's all in a grid.
For example, when I run exa -l
, almost all the files I see have the user ben
. But it's technically possible for me to create a user called aoesnudaoeugiaeonduia
or something, give them a file deep inside /usr
, and try to tree through that directory recursively. It'll only find out about that user once it's traversed many directories already, which means it wouldn't be able to print anything before that.
My (probably incomplete) list of partial solutions to this is, so far:
- Print out everything as normal, and expand the column width to the maximum every time it's hit. This might look weird, though...
- Try to guess what the maximum size will be for each column. Certain ones (such as permissions, or dates, or file sizes) can have a clearly-defined maximum. Others (such as usernames or group names) less so.
- Stop trying to pad anything at all, and instead use tab, or something even stranger like the ASCII group separator character (!)
- Truncate too-long entries with an ellipsis?
In any case, you have correctly identified that the --tree
output format uses the same mechanism as --long
, even though it doesn't need to. So that can definitely use an improvement such as this, with no drawbacks.
Also, a specific reply as to your implementation:
As long as the printing is initiated somewhere before the program completes it's good. In all honesty, a queue is a very high level abstraction for this. You could literally print each line as it becomes available to print if you wanted.
Right now each file is queried in parallel (at least with --long
it is), and the results get sent over a channel only to be sorted afterwards. So part of this queue is implemented already!
The reason it has to be sorted afterwards is that reading the data about the files isn't guaranteed to be done sequentially, and there's a chance that certain calls will take more time than others, and the results will be sent down the channel out-of-order. So sorting has to be done afterwards, which means there's no point doing it beforehand as well. The joys of parallel programming...
Hmm some good points, I might respond a bit asynchronously as I come up with solutions, so sorry in advanced...
By using human readable sizes you could force the file size to always fit reasonably in a fixed width size. With 6 characters, you could do this.
1B
1KB
1.1KB
10.1KB
100KB
1.10MB
I think 6 is the minimum, but bumping up characters you could get more granularity (a flag perhaps!?)
I think the next peculiarity is the username/group. You can get a listing of all groups and users before-hand by looking at the administrative database with getent
. By getting these ahead of time, that column could be fixed to the max size. This might look weird for people with edge case usernames. It might make sense here to collect some getent
results from real people and test the average length and variance in length to determine how much space will be wasted. I'm fairly against the idea of truncating, but it might be possible to do something clever like determine the minimum unique length and cut off everything 3 longer than that (to allow for ellipses). This conversion might take this:
jim
bobob
bobocobnob
trollolololol
to:
jim
bobob
boboc...
troll...
Which would still allow them to be uniquely identified, but might keep them shorter. Again, I think this solution may be in need of real data to test how often it triggers.
I think file name is the easiest one. When wrapping, simply include in indent to the next line so the line easily stands out as a continuation rather than the next entry.
Some other considerations
It may be possible to emulate the docker cli, which has far too many columns to fit in normal console windows, so it intentionally staggers them onto the next line for easy parsing.
Another option might be to keep all fields that you can control at the left while moving the variable length fields to the right. You'll end up with 2-3 fields on the right floating around a bit, but with proper coloration it may still be readable without being aligned perfectly. If a user needs to see a better alignment, it might be wise to jump out of tree mode.
Just a thought. Thanks for the response!
Hmm some good points, I might respond a bit asynchronously as I come up with solutions, so sorry in advanced...
Don't worry, I do this, like, all the time.
I'm fairly against the idea of truncating, but it might be possible to do something clever like determine the minimum unique length and cut off everything 3 longer than that (to allow for ellipses).
I wasn't sold on truncating too, but I get what you mean. The annoying thing is that the only times you'll want to specifically see the names are when you're trying to track down some kind of "why can't I write to this file" issue, and it'd be essential to not hide any file information from the user in times like that. My computer has a bunch of users with names like _nsurlstoraged
or _nsurlsessiond
that I have absolutely no idea what they do. Currently the rust-users crate doesn't allow you to get a list of all users, but that's my crate so I only have myself to blame for that! I'm also not sure if there's a standard convention of "users that begin with underscores are system users" that we could follow.
I think file name is the easiest one.
It is, because the filename column doesn't have any sizing/padding done to it: it'll always be on the right, and never gets padded!
It may be possible to emulate the docker cli, which has far too many columns to fit in normal console windows, so it intentionally staggers them onto the next line for easy parsing.
Ooooh, could you find a screenshot of this? I've never used Docker so I'm not sure what I'm meant to be looking for! exa already allows you to jump out of tree mode and list every folder with separate column withs (--long --recurse
).
Looking at the docker output now. Its an interesting case study, but I'm not sure how helpful it is. All of their columns have defined row sizes. They truncate the variable rows. I'll get a screen shot in a bit... gotta scrub the image first.
exa -lRTL2
drwxrwxr-x - 12 Oct 14:53 .
drwxrwxr-x - 12 Oct 14:53 ├─[ bob ]─ build
.rw-rw-r-- 1.6k 12 Oct 14:53 │ ├─[davidjfelix]─ checkstyle.cachefile
drwxrwxr-x - 12 Oct 14:53 │ ├─[ bob ]─ classes
drwxrwxr-x - 12 Oct 14:53 │ ├─[ tom ]─ dependency-cache
drwxrwxr-x - 12 Oct 14:53 │ └─[davidjfelix]─ libs
.rw-rw-r-- 2.4k 12 Oct 14:51 └─[davidjfelixwww ]─ checkstyle-suppressions.xml
Here's a potential mockup. The idea behind it is that the user is specified as part of the tree "wire". The width of the block is determined per-level. Each level is completed before beginning to print it.
I think I see how things are aligned there: every time we go into a new sublevel, we get a new set of column widths, and when we go out of that level back to the original, we revert back to the previous set?
Yes. It's a little odd to follow from the right, since the user/group name length fluctuates by level, but I think it should still be fairly easy to follow from the left where the tree is drawn. It's also important to note that the tree is (assuming black background) white text while the user or group is orange, so it should be easy to tell when nesting begins.
/etc/passwd
does not contain all users on OSX, per discussion about pre-allocating space.
Hi again! It’s been a while, but I did end up working on exa some more. The branch I just merged in doesn’t give exa streaming capabilities, but it does land two nice features:
- The link between the table and the
--long
view has been reversed. Previously, if you just wanted a tree, it still displayed a “table” of zero columns on the left, asked the computer for the time zone and locale information, ignore dthem, and then followed a bunch of special-case rules to only display the tree. Now, if you don’t want a table, you won’t pay for one. - I removed a bunch of places in the code where output was being buffered for no reason. exa used to collect each line in the table into a vector, and only start printing when all the lines had been rendered. Now it just prints them one-by-one, without the intermediate step.
So while it’s not streaming yet, it’s certainly faster, and the code improvements mean that implementing the hybrid breadth-depth search that you originally talked about is now a possibility.
Awesome. I'll update my provisioning scripts and check it out with my coworkers who are loving exa. Thanks!