pauldreik/rdfind

Fails to create correct equivalence classes over hard links

albcorp opened this issue · 1 comments

When using the -makehardlinks option, rdfind fails to unify sequences of hard links and copies of a single file.

WHAT I DID

I took the following steps to demonstrate the fault.

  1. Distribute a single file across a sequence of directories using a
    mix of cp and ln
    mkdir A B C D E F G H I J
    cp ~/Documents/README.rst A
    ln A/README.rst B/README.rst
    ln B/README.rst C/README.rst
    ln C/README.rst D/README.rst
    cp D/README.rst E/
    ln E/README.rst F/README.rst
    cp F/README.rst G/
    ln G/README.rst H/README.rst
    ln H/README.rst I/README.rst
    cp I/README.rst J/
    
  2. Verify the sequence of inodes of the copies
    ls -i [A-J]/README.rst
    
    Receive output of the form:
    2720488 A/README.rst  2720488 B/README.rst  2720488 C/README.rst
    2720488 D/README.rst  2720624 E/README.rst  2720624 F/README.rst
    2720650 G/README.rst  2720650 H/README.rst  2720650 I/README.rst
    2720708 J/README.rst
    
  3. Run rdfind across the sequence of directories
    rdfind -dryrun false -makehardlinks true [A-J]
    
    Receive output of the form:
    Now scanning "A", found 1 files.
    Now scanning "B", found 1 files.
    Now scanning "C", found 1 files.
    Now scanning "D", found 1 files.
    Now scanning "E", found 1 files.
    Now scanning "F", found 1 files.
    Now scanning "G", found 1 files.
    Now scanning "H", found 1 files.
    Now scanning "I", found 1 files.
    Now scanning "J", found 1 files.
    Now have 10 files in total.
    Removed 6 files due to nonunique device and inode.
    Total size is 76148 bytes or 74 KiB
    Removed 0 files due to unique sizes from list.4 files left.
    Now eliminating candidates based on first bytes:removed 0 files from list.4 files left.
    Now eliminating candidates based on last bytes:removed 0 files from list.4 files left.
    Now eliminating candidates based on sha1 checksum:removed 0 files from list.4 files left.
    It seems like you have 4 files that are not unique
    Totally, 56 KiB can be reduced.
    Now making results file results.txt
    Now making hard links.
    Making 3 links.
    
  4. Verify the sequence of inodes after relinking
    ls -i [A-J]/README.rst
    
    Receive output of the form:
    2720488 A/README.rst  2720488 B/README.rst  2720488 C/README.rst
    2720488 D/README.rst  2720488 E/README.rst  2720624 F/README.rst
    2720488 G/README.rst  2720650 H/README.rst  2720650 I/README.rst
    2720488 J/README.rst
    

WHAT I EXPECTED

I expected all the filenames to be hardlinked to the same inode, in
which case, the output should have been of the form:

2720488 A/README.rst  2720488 B/README.rst  2720488 C/README.rst
2720488 D/README.rst  2720488 E/README.rst  2720488 F/README.rst
2720488 G/README.rst  2720488 H/README.rst  2720488 I/README.rst
2720488 J/README.rst

I note that the estimate of disk usage reduction supports my
expectation.

I found this bug when attempting to compress sequences of backups made
using rsync with the --link-dest option. rdfind reported very
substantial disk space savings, but achieved none.

VERSIONS

This is rdfind version 1.4.1

On Fedora 36

Never mind, I see this is covered under Caveats on the GitHub repository. I found the explanation confusing. I expected the algorithm to be based on disjoint-set forests