shundhammer/qdirstat

Unreasonably high RAM usage

FluffyDiscord opened this issue ยท 26 comments

I have left it to scan my 28TB RAID 0 (2x 14TB drives) for over two days. It still did not finish calculating and at that time was taking 9GB of my system RAM. I had to kill it. This usage seems unreasonable to me because it was eating 1/3 of my total system RAM (32GB).

Is this intentional or a bug? What are my other options?

It depends on the number of files on those 28 TB what memory usage is reasonable. Every file and every directory consumes a little bit of RAM in QDirStat, but it's far from "unreasonable". It you have a gazillion files, you'll need n gazillion bytes of RAM; n being the needed bytes per filesystem object.

A plain file is represented in QDirStat by this:

https://github.com/shundhammer/qdirstat/blob/master/src/FileInfo.h#L948-L967

sizeof( FileInfo ) says 128 bytes. The only non-primitive data type in that class is the _name member, which is a QString, which stores each character of the name of this file in 2 bytes (Unicode). So a file named /usr/something/foobar consumes sizeof( "foobar" ) * 2 = 12 bytes + 8 bytes for string management: current string length, current allocated string length.

So that /usr/something/foobar file would consume 128 + 20 = 148 bytes.

A directory is represented by this:

https://github.com/shundhammer/qdirstat/blob/master/src/DirInfo.h#L593-L628

That's 264 bytes + the string length of the name. But on a typical Linux system, there are 10 times as many files as there are directories, so the impact of directories on the overall memory footprint isn't as bad as this may appear.

On average, each filesystem object will cosume some 160 bytes in QDirStat. That's not at all unreasonable.

Everything else that QDirStat uses in terms of memory footprint is negligible in comparison to the vast number of filesystem objects that it needs to keep in memory.

A scan of my entire openSUSE Leap 15.4 root directory with roughly half a million files and directories (including 47,000 directories) makes QDirStat consume 197.6 MB of RSS. An idle QDirStat (not reading any directory) consumes 74.6 MB of RSS, so we can safely assume that those 74.6 MB are the constant memory footprint of all the Qt GUI stuff that QDirStat uses.

That leaves 123 MB for those ~500,000 filesystem objects in the root directory plus assorted management data structures; roughly 258 bytes per filesystem object. That's not exactly the predicted 160 bytes, but it's in the same ballpark. I'd expect the fixed cost of those other data structures to diminish as the file tree gets larger.

Thank you for the in depth explanation. Would it be possible to add option to store the already indexed files/directories in a temp file/database (sqlite perhaps) and not hold them in RAM? I don't have the luxury to go and buy 64 or 128GB of RAM just to see what is eating my HDD space.

So, that does not seem unreasonable at all. Maybe it's your expectations that are unreasonable. QDirStat needs to keep the information about all filesystem objects in memory to do any meaningful operations.

28 TB can contain a lot of stuff. It really depends on your usage pattern: Do you have vast numbers of tiny files? Or do they tend to be fewer, but large files?

It is also entirely possible that there is an endless loop in your directory tree. That shouldn't really happen; Linux depends on a filesystem tree being a tree, not a generic graph. Also, it is not completely impossible (albeit unlikely) that some weird directory name causes an endless loop (an endless recursion) while reading directories; that happened before (but it should be fixed now). Since the low-level system calls that read directories only know of bytes, not characters in any encoding, some conversions from (assumed) UTF-8 to Unicode and/or back might inadvertently cut off a part of a directory name, resulting in one that was already read, and that subtree gets read again and again and again. But that is hard to track down, especially in very deep directory trees.

A temp file or an sqlite database is not an option here. The data need to be immediately accessible for various things, starting with both visualizations, the tree view and the treemap, including the coloring of the MIME categories. It continues with the more advanced visualizations like the file size distribution histogram, the file type view, the file age view.

QDirStat lives and breathes with having the data immediately accessible in memory. It's not just a report tool that keeps adding totals and subtotals where the detailed data are no longer needed after doing the addition.

I should have one directory with distributed file system of around 5.5 mil. files with uuid v4 as their names, and that should take known amount of around 11TB. My current total usage is 26TB though, that is why I would like to check what is going on there - which other applications are eating my storage.

So, please observe what QDirStat does while it scans directories. If you see the number of pending read jobs in any subdirectory always increasing, never decreasing, you might have one of those rare endless recursion problems.

Since the number of toplevel directories on that filesystem probably isn't that huge, you might try reading only one of them at a time, and observe the memory usage.

So, please observe what QDirStat does while it scans directories. If you see the number of pending read jobs in any subdirectory always increasing, never decreasing, you might have one of those rare endless recursion problems.

Since the number of toplevel directories on that filesystem probably isn't that huge, you might try reading only one of them at a time, and observe the memory usage.

Oh, the read jobs were ever so increasing, started low but last I noticed was around 200k

Oh, the read jobs were ever so increasing, started low but last I noticed was around 200k

That sounds like just this kind of problem.

Sometimes, bind mounts can cause an endless recursion. That would only matter if QDirStat wouldn't detect them as mounts, or if you intentionally make it keep reading on mounted filesystems. Can you exclude that possibility?

Right now, I can't deny that there might be a loop - I can't tell. I will run another scan, picking up different folder. I am running Unraid server and I have previously picked the /storage folder which listed all my drives (as I wanted), so I assumed there would be no issue.

Just to be sure - how exactly should the "read jobs" behave? For example ramp up to 1000 and keep at it for the rest of scan, or slowly add up to infinity or should be kept low?

As the directory tree is traversed, the directory that is currently read is read on that level in its entirety, and for every subdirectory encountered, a read job is added to the read job queue; it does not descend immediately into any subdirectories. When the directory is finished, the next job is picked from the queue, and the process repeats: That directory is read completely, and the subdirectories for that one are added to the queue.

So it's kind of hard to say how it should behave; it looks erratic from the outside, but it really is not. If there is a very deeply nested directory tree, it might legitimately happen that many jobs keep being added to the read queue for a long while.

But there are probably some commented-out logging lines in the code that can give more insights what's going on, so following QDirStat's log might help to narrow down the problem. Let me check.

If you uncomment this line, you will get logging about every directory that it starts reading:

https://github.com/shundhammer/qdirstat/blob/master/src/DirReadJob.cpp#L224C31-L224C31

    // logDebug() << _dir << endl;

to

    logDebug() << _dir << endl;

(and then of course recompile and install and run that version)

You could also check if you have a very deeply nested subdirectory tree with

tree -d

You might want to redirect the output to a file for easier inspection; the output may scroll by very quickly.

tree -d >/tmp/dir-tree

(you might have to install the package first; on openSUSE Leap 15.4, it's package tree).

If that also doesn't stop, you clearly have a loop in your directory tree. Which should not happen, of course, but who knows.

When I was about to write that Linux limits the maximum depth of directory nesting, and there is a predefined constant PATH_MAX how long the total path of any filesystem object can be, I came across this (again):

https://insanecoding.blogspot.com/2007/11/pathmax-simply-isnt.html

...which basically says that there really is no limit. I didn't test that myself, but it sounds plausible.

I am really interested in the result of that nesting depth.

Maybe it's time to add a maximum tree depth to QDirStat as a kind of "emergency brake" when a directory hierarchy goes wild.

It could open a warning pop-up while reading if the depth is beyond a certain threshold value (200? 1000?) and ask the user if it should continue to go even deeper, and maybe stop regardless at an absolute maximum.

The problem will be what to use as a reasonable default. Of course those values need to be configurable; not sure if they should be in the GUI config dialog or only in the config file for hard-core nerds.

But first I'd like to hear if that actually is the problem in your case.

I let the

tree -d >/tmp/dir-tree

run for roughly one day, then kill it since it did not finish yet. I could not spot any duplicates. The average folder depth was also around 5-7 folders deep.

I also had qdirstat scan different part of storage (mounted single Unraid "share" - two HDD in RAID 0-like scenario instead of last time whole /storage the qdirstat showed). Its been running for 29 hours, the read jobs seems steady. The RAM usage is now at 6GB. Here is screenshot of the current state. It has yet to finish the scanning as there should be around twice the currently reported total amount - 16TB

Do not be alarmed of the subdirectory counts of the 200k one, it should be around 350k in total.
image

after 60 hours it got to 11TB scanned and all my ram got eaten, unfortunately I have to use a different tool. Thank you for your time and the insights of this package

In your screenshot I see a grand total of about 26 million items (files and directories). With a RAM usage of about 260 bytes for each one, as we saw earlier in this thread, that should consume a total of about 6.2 GB, which is in line with what you observed.

But we can also see that there is one directory that has the vast majority of items (18 million), but consumes just 4.3 TB, but it has 229,653 subdirectories which sounds excessive. That might be the start of a runaway directory hierarchy: It might read the same subtree recursively again and again.

In your screenshot I see a grand total of about 26 million items (files and directories). With a RAM usage of about 260 bytes for each one, as we saw earlier in this thread, that should consume a total of about 6.2 GB, which is in line with what you observed.

But we can also see that there is one directory that has the vast majority of items (18 million), but consumes just 4.3 TB (i.e. just 260 bytes per file on average), but it has 229,653 subdirectories which sounds excessive. That might be the start of a runaway directory hierarchy: It might read the same subtree recursively again and again.

That one can indeed have that many subdirectories. The files there are being evenly distributed using hash-based directory structure. There are no hard links that I am aware of (I have made that storage/system myself)

That sounds like a BackupPC archive. It stores the real backup files in an evenly distributed hash and then maintains a separate tree for each backup client that references the hashes. This is how it accomplishes de-duplication to optimize the storage.
https://github.com/backuppc/backuppc/

Nothing fancy as that, the one I made is real stupid and simple :)

I added some debugging code to check if maybe the QString used for the name in the FileInfo object reserves more capacity than is actually needed, but it turned out that it doesn't: QString has a very smart handling for that.

Some more math...

In that 4.3 TB directory with 18.3e6 items, we get an average size of 250k.
In the 1.4 TB directory with 7.15e6 items, it's an average size of 210k.

Extrapolating this to that whole 28 TB RAID0, with an average 210k size, we can expect around 143e6 items. With an expected RAM usage of 260 bytes per item, the total memory footprint should be around 36 GB.

While that is more than your 32 GB physical RAM, it's not worlds apart. Allowing for some more RAM for the OS and other processes, a quick solution could be adding 8 GB swap space; either as a swap partition or as a swap file (or several swap files).

Idea: Skeleton Mode

Having said that, I have a vague idea in my head: From a certain number of total files on, QDirStat could change to a "skeleton" mode where it doesn't keep detailed information for each individual file in memory, only a summary below each directory node. As long as there are considerably more directories than there are files, that should support much larger directory trees:

Rather than

  • work
    • photos
      • trip-2023-07-02
        • 20230702-1801-dsc4711.jpg
        • 20230702-1803-dsc4712.jpg
        • 20230702-1807-dsc4713.jpg
        • ...
        • ...
        • 20230704-1850-dsc4787.jpg

it would be

  • work
    • photos
      • trip-2023-07-02
        • 53 files

That would save 52 of those 53 nodes in this case.

Why another additional node and not simply adding this to the directory node directly?

For one thing, that would clash in a gazillion places with the existing code; it would require a lot of changes and introduce a ton of new bugs. For another, it would allow for the concept of "fat children": Still keeping around dedicated nodes for some few extraordinarily large files; because those might be especially interesting. For example that DVD ISO that was downloaded a long time ago and then forgotten.

  • work
    • photos
      • trip-2023-07-02
        • some-huge-download.iso
        • another-huge-blob.iso
        • 51 files

For the treemap, normal files would not become visible anymore; because there is no more detailed information for them anymore. It would just be a gray gradient, just like for directories with lots of tiny files. The small files would vanish in that gray anonymity.

But if the information is still available, those "fat children" would still stick out: You'd still see that ISO as a large blob in the mass of gray.

That leaves details such as determining from what threshold on a file is considered large enough to deserve its own node in RAM, but that is a solvable problem.

Poll

But for starters, I am interested how common this problem really is. I'd hate to invest a lot of work for a one-trick pony.

As disk sizes keep growing (and RAM sizes probably reached a saturation point around 16 or 32 GB), is this a thing? Did many users find themselves limited by RAM to use QDirStat? If you also have this problem, please add a thumbs-up to this comment or leave a comment describing your scenario.

Edit: Closed. Nobody was interested.

Two weeks later: Not a single response. Okay, if nobody is interested, then neither am I.

Closing.

I just downloaded QDirStat and I like it a lot! I figured I'd chip in on this issue.

I have 14 million files under my home directory which I've been building up for around 10 years. I suspect I have significantly more files than the average person, since I use my computer a lot, I have several large code repos cloned into folders, and I have some large websites archived as loose files (5.5 million files just from the website archives).

QDirStat is using 4.0 GB of RAM, which is more than the other programs on my computer, but looking at the big picture, I have 16 GB of memory installed in my computer, so it's only a quarter. Considering what it's doing, it looks like it's very memory efficient - thank you!!

Running out of RAM from the file tree is probably a really rare situation, so I wouldn't be too worried about it :)

Glad to hear at least some feedback. Thank you!

14 million files using about 4 GB, that's about 300 bytes of RAM per file (taking ~73 MB of idle usage into account); that sounds about right. So you could go as high as about 3.5 times as many files without having to resort to swap space, i.e. about 50 million.