duperemove quadratic slowdown on incremental runs

Question

duperemove quadratic slowdown on incremental runs

trofi opened this issue a year ago · 22 comments

I use btrfs filesystem (compression enabled). I have about 30M files in 75K directories. It's a growing cache of build artifacts. It grows a directory at a time and has quite a bit of duplicates (50% to 70%). Most of the similar files have only partial match: ~20 bytes of ~1MB differ (different identifiers embedded).

I wondered if I could use duperemove incrementally one directory at a time. But look like at least by default duperemove does a lot of work when tried to deduplicate one file by reading (and maybe extracting extents?) for all known files.

The following seems to be enough to illustrate slowdown: we create N files and add files one by one into database (real use would be a directory-at-a-time):

$ rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i; done; }

I'll change 1000 to increase linearly to see how long it takes to process:

# 1k
real    0m8,429s
# 2k
real    0m21,543s (+13s)
# 3k
real    0m39,847s (+18s)
# 4k
real    1m4,735s (+25s)
# 5k
real    1m34,389s (+30s)

It's not a clean test as it also measures time to create the files. File creation is minor though: it takes 9ms to create 5k files.

I think we see a quadratic slowdown here. Is it an intended (required by design) behaviour? Or possibly an omission somewhere and incremental mode could be faster?

Thanks!

Answer 1 · 2023-08-16T21:24:18.000Z

Hello @trofi

When you do a command like this: duperemove -hr --hashfile=/tmp/hashes some-directory, duperemove will list and process the stuff in some-directory but will also check files found in the hashfile (do they still exist, maybe they changed etc)

For your use case, which is, as I understand it, some sort of append-only filesystem, all that work is useless and expensive

Could you check the quick code I push in the this branch ?
It adds a --ignore-existing-files options for you to check

As a test, I created a directory that contains 100 sub-directory, with 1000 files each (to 100000 files in total)
The hashfile has been built:

0.49 [jack:/tmp/mnt] hashstats /tmp/hashes
Raw header info for "/tmp/hashes":
  version: 3.0	block_size: 131072
  num_files: 100000	num_hashes: 0
Print top 10 hashes (this may take some time)

Running it from master, with just one directory as a target:

0.42 [jack:/tmp/mnt] /usr/bin/time -v duperemove -hr --hashfile=/tmp/hashes 1 > /dev/null
	Command being timed: "duperemove -hr --hashfile=/tmp/hashes 1"
	User time (seconds): 0.22
	System time (seconds): 0.78
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.02
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 40576
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 9575
	Voluntary context switches: 8
	Involuntary context switches: 151
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
0.57 [jack:/tmp/mnt] /usr/bin/time -v strace -fty duperemove -hr --hashfile=/tmp/hashes 1 |& wc -l
508305

And with the option:

0.24 [jack:/tmp/mnt] /usr/bin/time -v duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files >/dev/null
	Command being timed: "duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files"
	User time (seconds): 0.08
	System time (seconds): 0.03
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.12
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 22656
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 5086
	Voluntary context switches: 8
	Involuntary context switches: 14
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

0.77 [jack:/tmp/mnt] /usr/bin/time -v strace -fty duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files |& wc -l
13165

Answer 2 · 2023-08-16T21:32:30.000Z

Running your test on my system:

From master:

0.02 [jack:/tmp/test] 130 rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i >/dev/null; done; }

real	0m23.545s
user	0m6.188s
sys	0m17.464s
0.42 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 2000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i  >/dev/null; done; }

real	1m0.945s
user	0m17.297s
sys	0m43.896s

1.06 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 3000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i >/dev/null; done; }

real	1m48.824s
user	0m32.310s
sys	1m16.854s

With the option:

1.39 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }

real	0m16.674s
user	0m5.141s
sys	0m11.602s
1.11 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 2000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }

real	0m37.582s
user	0m14.209s
sys	0m23.538s
1.25 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 3000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }

real	1m2.863s
user	0m26.583s
sys	0m36.609s

Answer 3 · 2023-08-17T06:38:20.000Z

Thank you for such a quick response! --ignore-existing-files does indeed speed up incremental runs with a few caveats.

You are correct, it's a mostly append-only storage: less used directories are garbage-collected once in a few months and new directories are added every few minutes.

I tried to run the change on real storage (it's /nix/store storage from nix package manager) as:

$ cat ingest.bash
#!/usr/bin/env bash

find /nix/store -maxdepth 1 -type d |
while read d; do
    case $d in
        /nix/store) continue;;
        /nix/store/.links) continue;;
    esac

   echo "Ingesting $d"
    ./duperemove -r --hashfile=h.db --ignore-existing-files $d "$@"
done

Visually scanning the output it looks like some files are kept being reported as candidates on every run. If I pick 2 tily directories and add them manually it seems to report unrelated duplication attempts:

# it has 4 small files
./duperemove -r --hashfile=h.db --ignore-existing-files /nix/store/6ghc5jz83f7r4cd4cnpzagbk6jn1w92y-lv-4.51
...
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libblas.so.3"
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libcblas.so.3"
15728640        "/nix/store/7d0pxnfwsjk9wynnyg0bajscx6n17gmy-lapack-3/lib/liblapacke.so.3"
Showing 1 identical extents of length 1173158 with id 9ed5bac9
Start           Filename
0       "/nix/store/lmq1lqzh91knsc2i4a4yf3f58k1qg3s6-libmysofa-1.3.1/share/libmysofa/MIT_KEMAR_normal_pinna.sofa"
Showing 1 identical extents of length 1232695 with id 29000b05
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/setuptools-65.5.0-py3-none-any.whl"
Showing 2 identical extents of length 1441792 with id bf04ff05
Start           Filename
10616832        "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
10616832        "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Showing 1 identical extents of length 2055563 with id 80a4ef42
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/pip-23.0.1-py3-none-any.whl"
Showing 2 identical extents of length 2490368 with id a3daa133
Start           Filename
655360  "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
655360  "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Total files scanned:  2
Total extent hashes scanned: 3

# also has 4 small files, built against older libc version
$ ./duperemove -r --hashfile=h.db --ignore-existing-files /nix/store/7bzkrs0vgvr4fim4ayy0i8zqrf0z84zm-lv-4.51
...
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libblas.so.3"
15728640        "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libcblas.so.3"
15728640        "/nix/store/7d0pxnfwsjk9wynnyg0bajscx6n17gmy-lapack-3/lib/liblapacke.so.3"
Showing 1 identical extents of length 1173158 with id 9ed5bac9
Start           Filename
0       "/nix/store/lmq1lqzh91knsc2i4a4yf3f58k1qg3s6-libmysofa-1.3.1/share/libmysofa/MIT_KEMAR_normal_pinna.sofa"
Showing 1 identical extents of length 1232695 with id 29000b05
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/setuptools-65.5.0-py3-none-any.whl"
Showing 2 identical extents of length 1441792 with id bf04ff05
Start           Filename
10616832        "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
10616832        "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Showing 1 identical extents of length 2055563 with id 80a4ef42
Start           Filename
0       "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/pip-23.0.1-py3-none-any.whl"
Showing 2 identical extents of length 2490368 with id a3daa133
Start           Filename
655360  "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
655360  "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Total files scanned:  2
Total extent hashes scanned: 3

It looks like libicudata.so.73.2 (and many others) gets brought up every time new run is started.

Do I understand correctly that duperemove keeps considering and re-considering the same extents unrelated to incremental input for deduplication?

Answer 4 · 2023-08-17T06:45:44.000Z

If I run it with --quiet and remove the header here is how progress looks like:

Ingesting /nix/store/hxnwgjqv2ghx2ms9mqz4cz5w4mhgbc6y-file-5.45-dev
Found 18649 identical extents.
Simple read and compare of file data found 16938 instances of extents that might benefit from deduplication.
Total files scanned:  3

Ingesting /nix/store/7pynr1g9si15n72l0wzp8nz4qf5xb36x-libdbusmenu-qt
Found 18649 identical extents.
Simple read and compare of file data found 16938 instances of extents that might benefit from deduplication.
Total files scanned:  67

Ingesting /nix/store/n604q5jmihjc3qfhgbfwknrlsqbwj19x-libkate-0.4.1
Found 18523 identical extents.
Simple read and compare of file data found 16812 instances of extents that might benefit from deduplication.
Total files scanned:  154

Ingesting /nix/store/bdds9wax3qsnhjxfkinkxzdav3fdxriq-python3.10-pyudev-0.24.1
Found 18523 identical extents.
Simple read and compare of file data found 16812 instances of extents that might benefit from deduplication.
Total files scanned:  57

I suspect is means duperemove keeps reconsidering deduping roughly the same ~18K extents extents on each run.

Answer 5 · 2023-08-17T08:16:54.000Z

Hello
You are correct, existing files are always considered for deduplication
To be more specific, any new extent are matched against the whole dataset

There is an remaining optimisation: deduplicating the new extent with a single old extent (and not the whole list)
All old extents are already deduplicated, so they live on the same physical position

That is a more difficult change, I'll think about it

Answer 6 · 2023-09-06T14:05:02.000Z

Hello @trofi

First of all, I have merge the previous code into master (fa63f6e)
The option has been renamed: duperemove [..] --dedupe-option=norescan_files

Also, I have pushed a whole bunch of changes that, I hope, will fix your issue
Could you check this branch ?

Answer 7 · 2023-09-06T17:29:49.000Z

`--dedupe-option=norescan_files` test: all good

Tested on a real system as:

$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | xargs -t -n 1000 ./duperemove -q -r --hashfile=h.db --dedupe-option=norescan_files

drop-v2 branch behaves quite nicely! Thank you!

Running directory-at-a-time is a bit slow (each directory gets processed at least in about one second), but batching directories 100-at-a-time of 1000-at-a-time (like below) makes it very snappy!

My final database is 3.5G (looks decent).

`--dedupe-option=norescan_files,partial` test: seems to hang

Possibly related question: an incremental mode is also useful to see if duperemove gets stuck or is able to progress when ran against a small directory at a time.

I tried to increase deduplication precision by throwing --dedupe-option=norescan_files,partial at my 3.5G database:

$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | xargs -t -n 1 ./duperemove -q -r --hashfile=h.db --dedupe-option=norescan_files,partial
./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/57904rflsfh46y7fxi62rzc7sidc9vmq-user-environment
No dedupe candidates found.
./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
Found 1552412 identical extents.

And the output is stuck on it (30+ minutes without progress). The directory does not look big:

$ du -hs /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
1,2M    /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
$ find /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125 | wc -l
14

Did duperemove got stuck indefinitely? Or it's just superslow?

Answer 8 · 2023-09-06T19:14:33.000Z

@trofi Could you checkout that branch again ? I have just pushed a couple more code

Answer 9 · 2023-09-06T19:23:55.000Z

Also note that you will need to recreate the hashfile
The partial mode is only used for new (or modified) files

Answer 10 · 2023-09-06T19:48:15.000Z

Deleted hash file, checked out latest branch state. Getting the following crash:

$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | LANG=C xargs -t -n 1 ./duperemove -r --hashfile=h.db --dedupe-option=norescan_files,partial./duperemove -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/57904rflsfh46y7fxi62rzc7sidc9vmq-user-environment
Gathering file list...
Adding files from database for hashing.
No dedupe candidates found.
./duperemove -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
Gathering file list...
Adding files from database for hashing.
Loading only duplicated hashes from hashfile.
Found 1 identical extents.
ERROR: hash-tree.c:68
[stack trace follows]
./duperemove(print_stack_trace+0x2e) [0x40a48e]
./duperemove(insert_hashed_block+0x3f1) [0x406fe1]
./duperemove(dbfile_load_block_hashes+0x86) [0x410e06]
./duperemove(process_duplicates+0x1a7) [0x406777]
./duperemove(populate_tree+0x25a) [0x40c32a]
./duperemove(main+0x8bc) [0x405f7c]
/nix/store/905gkx2q1pswixwmi1qfhfl6mik3f22l-glibc-2.37-8/lib/libc.so.6(+0x23ace) [0x7ffff7b63ace]
/nix/store/905gkx2q1pswixwmi1qfhfl6mik3f22l-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89) [0x7ffff7b63b89]
./duperemove(_start+0x25) [0x406505]
xargs: ./duperemove: terminated by signal 6

gdb's backtrace if it's of any help:

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44      pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7bc7af3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7b78c86 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7b628ba in __GI_abort () at abort.c:79
#4  0x0000000000406fe6 in insert_file_hash_head (head=<optimized out>, dups=<optimized out>) at hash-tree.c:128
#5  add_file_hash_head (block=<optimized out>, dups=<optimized out>) at hash-tree.c:151
#6  insert_hashed_block (tree=tree@entry=0x7fffffffaa30, digest=digest@entry=0x430348 "[\203\371\210\272\344UV+@G\035\022\344:\240", file=0x6206e0, loff=loff@entry=0, flags=flags@entry=8) at hash-tree.c:290
#7  0x0000000000410e06 in dbfile_load_block_hashes (hash_tree=hash_tree@entry=0x7fffffffaa30) at dbfile.c:1357
#8  0x0000000000406777 in process_duplicates () at duperemove.c:578
#9  0x000000000040c32a in populate_tree (cfg=cfg@entry=0x41b300 <dbfile_cfg>, batch_size=0, callback=callback@entry=0x4065d0 <process_duplicates>, scanned_filerec=scanned_filerec@entry=0x41b340 <scanned_filerec>)
    at file_scan.c:1034
#10 0x0000000000405f7c in create_update_hashfile (filelist_idx=<optimized out>, argv=<optimized out>, argc=<optimized out>) at duperemove.c:652
#11 main (argc=<optimized out>, argv=<optimized out>) at duperemove.c:741

Answer 11 · 2023-09-08T10:37:13.000Z

@trofi I have struggle reproducing our issue :/
On the other hand and after checking a bit deeper, that commit is buggy so I have reworked it

Answer 12 · 2023-09-08T16:21:52.000Z

Tried commit 4a48838. It is able to run agains a few directories successfully and then fails:

$ rm h.db
$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | LANG=C xargs -t -n 1 ./duperemove -r --hashfile=h.db --dedupe-option=norescan_files,partial
...
[19/19] (100.00%) csum: /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
Loading only duplicated hashes from hashfile.
Found 1 identical extents.
ERROR: hash-tree.c:68
[stack trace follows]
./duperemove(print_stack_trace+0x2e) [0x40a40e]
./duperemove(insert_hashed_block+0x3f1) [0x406f61]
./duperemove(dbfile_load_block_hashes+0x86) [0x410d76]
./duperemove(process_duplicates+0x138) [0x4066f8]
./duperemove(populate_tree+0x1b6) [0x40c206]
./duperemove(main+0x8b4) [0x405f74]
/nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6(+0x23ace) [0x7ffff7b63ace]
/nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89) [0x7ffff7b63b89]
./duperemove(_start+0x25) [0x4064f5]
xargs: ./duperemove: terminated by signal 6

gdb backtrace:

Program received signal SIGABRT, Aborted.
0x00007ffff7bc7a8c in __pthread_kill_implementation () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7bc7a8c in __pthread_kill_implementation () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#1  0x00007ffff7b78c86 in raise () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#2  0x00007ffff7b628ba in abort () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#3  0x0000000000406f66 in insert_file_hash_head (head=<optimized out>, dups=<optimized out>) at hash-tree.c:128
#4  add_file_hash_head (block=<optimized out>, dups=<optimized out>) at hash-tree.c:151
#5  insert_hashed_block (tree=tree@entry=0x7fffffffaaa0, digest=digest@entry=0x4307c8 "[\203\371\210\272\344UV+@G\035\022\344:\240", file=0x45b4e0,
    loff=loff@entry=0, flags=flags@entry=8) at hash-tree.c:290
#6  0x0000000000410d76 in dbfile_load_block_hashes (hash_tree=hash_tree@entry=0x7fffffffaaa0) at dbfile.c:1357
#7  0x00000000004066f8 in process_duplicates () at duperemove.c:575
#8  0x000000000040c2d6 in populate_tree (cfg=cfg@entry=0x41b300 <dbfile_cfg>, batch_size=0, callback=callback@entry=0x4065c0 <process_duplicates>)
    at file_scan.c:1039
#9  0x0000000000405f74 in create_update_hashfile (filelist_idx=<optimized out>, argv=<optimized out>, argc=<optimized out>) at duperemove.c:644
#10 main (argc=<optimized out>, argv=<optimized out>) at duperemove.c:732

Answer 13 · 2023-09-09T08:43:51.000Z

There is another bug report that seems similar: #285

Answer 14 · 2023-09-09T08:56:18.000Z

@trofi Could you send me your hashfile (the one that produces the crash) ?

Answer 15 · 2023-09-09T09:40:30.000Z

Sure! https://slyfox.uni.cx/b/2023-09-09-h.db is a 3MB h.db I got. Should not contain any sensitive data.

Answer 16 · 2023-09-09T09:50:09.000Z

Thank you!

sqlite> SELECT * FROM hashes JOIN (SELECT DISTINCT digest FROM hashes WHERE digest IN (SELECT DISTINCT digest FROM hashes WHERE ino IN (SELECT DISTINCT ino FROM files WHERE dedupe_seq > (SELECT keyval FROM config WHERE keyname = 'dedupe_sequence'))) GROUP BY digest HAVING count(*) > 1) AS duplicate_hashes on hashes.digest = duplicate_hashes.digest;
[���UV+@G�:�|3730010316|848|0|8|[���UV+@G�:�
[���UV+@G�:�|3730010316|848|0|8|[���UV+@G�:�

So there are indeed duplicates records in the hashfile
I have added unique constraints to the prevent such record from being inserted

This way, you will crash earlier :)

I will dig deeper

Answer 17 · 2023-09-09T10:06:30.000Z

I think it detects the constraint violation now. But looks like it does not cause the duperemove to stop and instead it loops over and over:

$ ./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin
Warning: The hash file format in Duperemove master branch is under development and may change.
If the changes are not backwards compatible, you will have to re-create your hash file.
dbfile_store_block_hashes()/2739581: Database error 19 while executing statement: constraint failed
Skipping file due to error 19 from function csum_by_extent (No such device), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
dbfile_begin_trans()/2739578: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/tzselect
dbfile_begin_trans()/2739586: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/xtrace
dbfile_begin_trans()/2739579: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/localedef
dbfile_begin_trans()/2739570: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/nscd
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
...

Answer 18 · 2023-09-09T17:27:37.000Z

Nothing fancy on subvolumes side, but yes on hardlinks. All identical files are already hardlinked to the same inodes:

$ LANG=C stat /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64 /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
  File: /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64
  Size: 29248           Blocks: 64         IO Block: 4096   regular file
Device: 0,30    Inode: 3730010316  Links: 5
Access: (0555/-r-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-08-05 11:00:51.000000000 +0100
Modify: 1970-01-01 01:00:01.000000000 +0100
Change: 2023-08-05 11:00:52.321137476 +0100
 Birth: 2023-08-05 11:00:51.598124397 +0100
  File: /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
  Size: 29248           Blocks: 64         IO Block: 4096   regular file
Device: 0,30    Inode: 3730010316  Links: 5
Access: (0555/-r-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-08-05 11:00:51.000000000 +0100
Modify: 1970-01-01 01:00:01.000000000 +0100
Change: 2023-08-05 11:00:52.321137476 +0100
 Birth: 2023-08-05 11:00:51.598124397 +0100

I have 4 names for it:

$ find /nix/store/ -inum 3730010316
/nix/store/.links/0miarylrg34m2gn2hji7a8q6rvf96zicp93q7rf87w7dwfmc41l5
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V7_LP64_OFF64
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/XBS5_LP64_OFF64
/nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf

Answer 19 · 2023-09-09T18:26:32.000Z

Ok, more bugs were squashed

I have added more constraints on the hashfile database, as well as, hopefully, fixed your issue (c89a771)

Also, the partial lookup is still a bit slow on my system, it appears that the lookaside allocator is disabled at compile-time on libsqlite3 on Debian
An issue has been raised

edit: more bugs found .. (and fixed with 1657ab8)

Answer 20 · 2023-09-13T21:10:58.000Z

I somehow am seeing SIGSEGVs in multithreaded mode on ./duperemove -q -r '--hashfile=h.db' '--dedupe-options=norescan_files' -d ... (commit 5ad3361).

gdb backtrace:

(gdb) bt
#0  0x00007f4b16fbcded in sqlite3DbMallocRawNN ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#1  0x00007f4b16fc04fb in sqlite3DbStrDup ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#2  0x00007f4b16fce527 in sqlite3VdbeScanStatus ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#3  0x00007f4b16fce5db in sqlite3WhereAddScanStatus ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#4  0x00007f4b1703a4db in sqlite3WhereBegin ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#5  0x00007f4b17061484 in sqlite3Update.constprop.0 ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#6  0x00007f4b170445a4 in yy_reduce.constprop ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#7  0x00007f4b17045de7 in sqlite3RunParser ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#8  0x00007f4b1704672f in sqlite3Prepare ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#9  0x00007f4b17046ba3 in sqlite3LockAndPrepare ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#10 0x00007f4b17046d06 in sqlite3_prepare_v2 ()
   from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#11 0x000000000040f28e in dbfile_update_extent_poff (db=db@entry=0x11c8778,
    ino=4451283826, subvol=848, loff=3276800, poff=2009557819392) at dbfile.c:343
#12 0x000000000040d6dc in extent_dedupe_worker (kern_bytes=0x7f4b067fbd80,
    fiemap_bytes=<synthetic pointer>, dext=<optimized out>) at run_dedupe.c:518
#13 dedupe_worker (priv=<optimized out>, counts=0x7ffee50daf20) at run_dedupe.c:535
#14 0x00007f4b1716511a in g_thread_pool_thread_proxy ()
   from /nix/store/hl36mh6qdb8a7ymhql4qmfza3cdr24in-glib-2.76.4/lib/libglib-2.0.so.0
#15 0x00007f4b171647cd in g_thread_proxy ()
--Type <RET> for more, q to quit, c to continue without paging--
   from /nix/store/hl36mh6qdb8a7ymhql4qmfza3cdr24in-glib-2.76.4/lib/libglib-2.0.so.0
#16 0x00007f4b16e25dd4 in start_thread () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#17 0x00007f4b16ea79b0 in clone3 () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6

Adding --io-threads=1 --cpu-threads=1 makes the crash to go away.

Answer 21 · 2023-09-20T16:34:26.000Z

Hello @trofi

All the code has been merged in master
I am currently running it against my live system, don't hesitate to report any issues you found

Thank you

Answer 22 · 2023-09-20T18:58:11.000Z

Sounds good. Let's close this issue then.

--dedupe-option=norescan_files test: all good

--dedupe-option=norescan_files,partial test: seems to hang

`--dedupe-option=norescan_files` test: all good

`--dedupe-option=norescan_files,partial` test: seems to hang