shundhammer/qdirstat

Multi-hardlink file sizes look incorrect on CentOS 6.5

DennisFlanagan opened this issue · 29 comments

I built the QDirStat on CentOS 6.5 successfully and it seems to run properly. But the sizes reported are far too low. I'm analyzing a massive directory tree: 1/4 TB over 1 M files. The behavior suggest integer overflow. Has anybody experienced such behavior?

No reports about anything similar so far. QDirStat uses long long for file sizes, so an int overflow should really not happen anywhere ever. And 1/4 TB isn't really that much anyway these days.

What do other tools report? Did you try du -hs on that same directory tree?

When you generate a cache file with the qdirstat-cache-writer script and load that cache file in QDirStat, do you get the same result?

Finally, you are aware that QDirStat uses 1024-based units, right?
So 1 kB = 1024 Bytes; 1 MB = 1024 kB; 1 GB = 1024 MB; 1 TB = 1024 GB;
See also the exact byte size that is shown upon mouse click with the latest version 1.6.1 or git master.

https://github.com/shundhammer/qdirstat/blob/master/src/FileInfo.h#L29

So it's up to 9 223 372 036 854 775 807 Bytes = 8 388 608 TB.

I'm running WinDirStat on same directories, which is how I noticed the discrepancy. I ran du, ncdu, and even ancient perl script of my own to confirm the correct values. I am aware of the 2^10 units. The totals are not close -- t0o low by an order of magnitude. If I poke at individual files, the values are correct. I looked at a smaller directory tree, and the results were fine.

I'll try the cache file suggestion.

Any mount points in that directory tree? QDirStat does not scan mounted filesystems by default. You can easily check that in the log file.
What filesystem type is that? If it's Btrfs: Subvolumes should be scanned.

Also, please notice that QDirStat has a default exclude rule for directories named .snapshot because some very common backup tools use that (for hourly / daily / weekly etc. backups, so they really get in the way).

The directories I'm analyzing are actually on a NetApp's filer. Excluding .snapshot is critical, and is certainly working correctly. Note that the item and file counts are correct -- only the total size is wrong.

The behavior gets even more bizarre with cache files. The values are much lower when I read the cache file. Even a smaller directory tree that shows a correct total shows a lower when I read a cache file.

Perhaps the problem is the file system. I'll try a different file system.

The weird behavior happens on a local file system as well. I'll assume that something went wrong with my build. I'll try to get a pre-built download to work.

Please also notice that QDirStat also takes into account

  • sparse files (it only adds up the really allocated size, not any "holes" in the file)

  • hard links (each hard link is accounted for size / number_of_links)

Please have a look at the log for anything suspicious. Excluded or unreadable directories should show up there, as well as any sparse files encountered while reading the directory tree.

For hard links, you could check the generated cache file; any entry with a field links: xxx has multiple hard links.

If you can (if it doesn't contain anything confidential), attaching the log file here might help.

I am aware of the subtle effects of sparse files and hard links. Neither is at issue here.

I'm embarrassed to admit that I had not yet looked at the log file. I've attached what I could find. Could the missing commands cause these symptoms? It seems unlikely to me. This does not contain a full history. I suppose it is only for the last week or so. In any case, if you see any clue, I appreciate any guidance.

qdirstat.log

Until proven otherwise, I assume that my symptoms are the result of local system problems. For various painful reasons, we're forced to work with old OS versions. The process of building did not go smoothly, and I had to rely on others to install prerequisite packages for me. The behavior suggests that there is an incompatibly somewhere such that 64-bit value are getting chopped into a 32-bit space. I don't want to chew up your time with my configurations problems. I opened this issue on the hope that somebody might have run into similar trouble.

Nevermind the "command not found" messages in the log: That's just tests what package manager your system is using (since QDirStat can show you what package, if any, a file belongs to). It found "rpm" (which is expected for CentOS, of course) and not "dpkg" or "pacman". That's perfectly okay.

That program run in that log shows normal startup and shutdown behavior. I don't see anything problematic while reading that directory; but then it probably was only a small tree where this would be expected.

So I can only recommend that you have a close look at the log file for suspicious messages when reading larger directory trees. You may be able to drill down subtrees with both QDirStat and a similar program side by side. You will find that a surprising number of programs does not take sparse files or hard links into account (if you have any of those), and many programs don't stop at filesystem boundaries.

Some of those newer package formats like FlatPak or Snap may also do creative things with bind-mounts which may confuse other programs. Be on the lookout for messages about mount points in the log.

If you find anything more concrete, feel free to reopen this.

I found the problem. I misinterpreted how QDirStat accounts for multiple hard links. But it seems that QDirStat does not write sizes to a cache file properly. I'll explain both issues, and then seek your advice on how to proceed.

I should preface my remarks by emphasizing that we use hard links heavily. A common pattern is a sequence of snapshots of a directory tree where the next snapshot just links to files that have not changed. When analyzing a directory, we care about the number of links to an inode within a snapshot, but we wish to ignore links from outside. We expect the behavior of du.

As you say, QDirStat simply divides the file size by the number of links. The number of links to a typical file in the tree that I am analyzing is 10-20, but each file is unique within a snapshot. This explains why the values are so far from what we expect.

The discrepancy finally sunk in while I was reviewing the cache format in detail. I also discovered that QDirStat was writing the averaged size to the cache file. But when it reads the cache file, it appears to sum the average sizes. This explains why I see another order of magnitude drop in the file size sums.

Note that the qdirstat-cache-writer script writes the actual file size. I confirmed that loading the cache produced by the script produces the same result as opening the directory directly. But writing the cache and reading it back produces incorrect results.

If you concur that there is a bug in the write cache behavior, I'm willing to help address it. Please let me know how you want to proceed on this issue.

How do you advise that I proceed to get the behavior with hard links that we want? I could attempt to produce a derivative tool with different behavior. That seems imprudent. It seems preferable to derive a version of the cache writer script. But I want the multithreaded performance of QDirStat. So, my preference would be to produce a multithreaded scanner that generates a cache file. If I do that, would it make sense to integrate with the QDirStat code base?

I'll start by writing a cache generator. I already have an efficient, albeit single-threaded, scanner. I need to write a formatter. I will wait for any advice you are willing to offer before proceeding further.

I appreciate your patience and the assistance that you have generously offered.

So, we might have two different problems here:

  • Cache file writing might write the wrong size for files with multiple hard links. I haven't checked that yet. It is also important to notice that there are two cache file writers: The qdirstat-cache-writer Perl script and the CacheWriter C++ class that can be used from the File -> Write to Cache File menu item.

  • Files with multiple hard links where some of those hard links are outside the scanned directory tree might report their size in a way that is surprising / confusing / unexpected to some users.

    This is where the simplistic strategy of simply adding size / numer_of_hard_links falls apart. That was always the limitation of that strategy; this might or might not be the expected thing, depending on the exact use case.

Related issue explaining it in more detail: #25

Since nobody ever asked for this, I never got myself to invest more effort to improve that hard link handling. But now that there appears to be a real use case, it might be time to take care of it.

How to report accumulated sizes for files with multiple hard links depends on the specific use case: To get a rough overview of a directory tree, the simplistic size / nlinks approach serves quite well. It is also fair if you want to get an impression how much disk space your directory /usr/share/locale/foo adds to your system's overall footprint; if it shares many files with another similar directory, any disk space saving gain for deleting this subtree might be negligible or even close to zero.

If you want to make a backup of that directory, however, you need to know the exact size disregarding the other hard links: On the backup medium, each file will consume its full size regardless of hard links outside its subtree.

So here is what I suggest:

While scanning the directory tree, keep track of files with multiple hard links in a separate place; add a new special class inode that holds (unsurprisingly) the ino and device major and minor device number (and the device name and the preferably the mount point for readability).

When a file with multiple hard links is found, check the inodes it it is already there, and if not, create one. ino, dev_major, dev_minor and of course nlinks are reported by the underlying lstat() syscall anyway. Add a pointer to the newly created FileInfo to that inode.

After scanning the directory tree is done, iterate over the inodes. Delete each one where the number of FileInfo pointers (i.e. files with that ino found while scanning the tree) is equal to nlinks: Those inodes are all completely inside the directory tree with all their hard links accounted for, so simply adding up size / nlinks is perfectly adequate for them.

If there are any inodes left over, those are the ones that have hard links outside of the scanned directory tree; those need special treatment.

Now there is the challenge how to visualize those leftover / out of tree hardlinks.

Since they are out of the scanned and visualized directory tree, there is very little information available about them: We know the number of unacounted hard links, the ino (which is not very useful for the user) and the other hard links (which might be useful for the user).

At the very minimum, we can display the unaccounted size: size / leftover_nlinks. We can add up all those sizes. We can display a pseudo-item in the directory tree for it.

/usr/share/locale/foo       42 MB
    + <Files>               22 MB
    + bar/                  16 MB
    + baz/                   4 MB
<out of tree hardlinks>    220 MB

Notice that this pseudo item is outside that tree on the same tree level, so it is clear that it is not added to the sum by default. But you can simply select both the tree root /usr/share/locale/foo and <out of tree hardlinks> to get the complete sum.

The latter sum would be the one you want for making a backup of that tree, the former would be the relevant one if you consider what benefit you would get from deleting that subtree.

This could go even further: That pseudo-item could be expandable to show more information. A user might be interested to see details about the hardlinked files.

<out of tree hardlinks>
+ <inode 4711>
   + /usr/share/locale/foo/LC_COLLATE/somestuff
   + /usr/share/locale/foo/LC_COLLATE/morestuff
   + (3 more outside of /usr/share/locale/foo)
+ <inode 5822>
  + ...

Let me think about this. I am beginning to warm up to the idea. 😃

I will explain my perspective and experience with this issue in greater detail soon. For now, please correct the related issue that you intended to reference. The link you set just points back to this issue.

For now, please correct the related issue that you intended to reference. The link you set just points back to this issue.

Duh; copy & paste broken as usual on Linux desktops. 😢 Fixed.

I'll elaborate on my use case. The snapshot directories that I am analyzing belong to a continual distribution system. States are posted to a central repository that retains a broad, floating range of snapshots. There are a small number of producers and a large number of independent consumers. Each consumer refreshes its state on demand. The posting/distribution servers always run Linux, but the producers and consumers often run Windows. It is important that the hard link structure passes from producer to consumer accurately. This means that the actual link count in the inode is nearly moot to us. It changes continually and the value is effectively a metric of duration. A value of one is rare: it indicates either that the file was recently modified, or has reached the end of its shelf life. It is critical for us that only links within the directory tree scanned are considered.

I can advise you from experience that you are best served to refactor your object model to match the Unix file system: a file is identified by index, and a link is a named reference to a file via its index. I tried to dance around this for years, but it only lead to frustration. If you extend the file object to hold a visit count, you can visualize the effect of the hard links in a varieties of perspectives, each of which has value. You can optimize processing for the typical case where the link and visit count are equal, although this is moot for my use case.

I now have a clear plan on how to proceed. I want to make QDirStat my principal static analysis tool. I hope that you will extend its capabilities to handle hard links better. But I can work with it as is.

My principal interface to QDirStat will be your cache format. The analogus format for our core distribution tool aligns well with your long format. It will be straightforward to produce either format. I'm going to take the opportunity to write a multi-thread scanner (a couple decades overdue), that runs on Windows as well as Linux (I'll do MacOS as well, since it is what I actually type on).

First, a small fix: The internal cache writer indeed wrote the wrong size to the cache file; it was the calculated result of size / links , not size itself. Since it also writes a links: field, it would do the division again when loading that file, thus displaying a size that was much too small.

This affected only the C++ class, not the qdirstat-cache-writer Perl script.

I'm going to take the opportunity to write a multi-thread scanner (a couple decades overdue), that runs on Windows as well as Linux (I'll do MacOS as well, since it is what I actually type on).

Contributions are welcome.

This would be an alternative to the existing very simple qdirstat-cache-writer which however would need to stay around:

  • As a matter of trust for admins who plan to use a script that everybody can understand on their production servers with sensitive data

  • With minimalistic dependencies for admins who don't want to install a number of CPAN modules or anything else on their servers for the sake of running such a script.

I am also not sure how much multiple threads will actually help; this is largely I/O bound, not CPU bound. But it is worth a try, of course.

Please check out and build this branch:

https://github.com/shundhammer/qdirstat/tree/huha-hard-links

This has a new config option IgnoreHardLinks in ~/.config/QDirStat/QDirStat.conf:

[DirectoryTree]
...
IgnoreHardLinks=false

Edit: A previous version had this in the settings dialog; you now have to edit this config file manually.

Normally, for files with multiple hard links, each directory entry found for that file accounts for its proportion of that size, i.e. size / number_of_links:

normal-hard-links

With that option "Ignore hard links" set, each directory entry accounts for the full size; if multiple entries for that file are found within the same directory, the size for that file is added up multiple times:

ignore-hard-links

Notice the different directory sum, the different size in the status line, and also the different sort order caused by the different result of each item's different size that it now reports.

While that would be pretty counterproductive for the directory in this example (which I found on my Ubuntu 18.04 LTS at home), it might be just what you need for your use case.

Please build that branch and experiment with it. Is this useful for you? Or is your use case even more complex with even more requirements?

This version is not working as I expect, based on your description. I see the new config option, but I observe no difference in behavior when I select it.

I do observe that this version corrects the C++ cache writer. The file sizes reported are accurate and match the Perl cache write output.

I am going to tweak your cache writer script such that it reports ids, rather than names. I can then send you my examples without compromising the confidentiality of the content.

This version is not working as I expect, based on your description. I see the new config option, but I observe no difference in behavior when I select it.

Not even after restarting the program?

Indeed, I did not grasp that I needed to restart. I can now confirm that the tool works in a comprehensible manner.

I now have two alternatives to obtain results from QDirStat that I understand. If I know that all but one hard link is external to the directory tree, I can select the "Ignore hard links" option.

I have been using a more flexible alternative. I generate a cache file where secondary hard links are converted to symbolic links. Identifying the "primary" hard link is subjective, and I can pick any heuristic the suits the situation.

I propose that we close this issue, since you have fixed the bug in the cache write and provided a mechanism to ignore out-of-scope hard links. But I would like to open a new issue where I hope to entice you to provide additional enhancements. Lastly, once I made a reasonable initial stab at my multi-threaded scanner, I'll propose a way that we can collaborate. Is this reasonable?

I have a related question. I did manage to modify your qdirstat-cache-writer script so that it displays ids rather than names. Are you interested in this? It provides a way to exchange examples while maintaining privacy.

Merged the branch to master.

Lastly, once I made a reasonable initial stab at my multi-threaded scanner, I'll propose a way that we can collaborate. Is this reasonable?

Sure. I propose you fork this repo and keep those changes in a separate branch so we can do the ususal drill with a pull request and the necessary discussions.

See also

https://github.com/shundhammer/qdirstat/blob/master/doc/GitHub-Workflow.md
https://github.com/shundhammer/qdirstat/blob/master/doc/Contributing.md

As mentioned here before, I want to keep a very minimalistic version of the qdirstat-cache-writer script around as the default so people can easily evaluate what it does to see that there are no backdoors or other evil things.

But we can add a number of alternative versions for improved performance. I still doubt very much, though, that multiple threads would really speed this up very much: This is completely I/O bound since there are a gazillion of system calls which all require context switches between user and kernel mode. There is very little computing that could benefit from using multiple CPU cores in multiple threads.

But feel free to try and experiment, of course.

I have a related question. I did manage to modify your qdirstat-cache-writer script so that it displays ids rather than names. Are you interested in this? It provides a way to exchange examples while maintaining privacy.

This seems to be a very specialized tool. While it might come in handy in a few very select cases, I am not sure if it's worthwhile the extra maintenance work over time. As mentioned above, I try to keep the qdirstat-cache-writer minimalistic.

Sadly, all the cache writing / reading is already quite a fringe case that does not appear to be used that much, so problems remain unnoticed for a long time (as we saw with the hard links accounting earlier in this issue). I fear the more code is added in that general area, the more poorly tested / maintained code will accumulate.

Maybe this would be a good candidate for adding to a debug branch (explicitly without support, debug only) so users could be pointed to it when the need arises.