duperemove quadratic slowdown on incremental runs
trofi opened this issue · 22 comments
I use btrfs
filesystem (compression enabled). I have about 30M files in 75K directories. It's a growing cache of build artifacts. It grows a directory at a time and has quite a bit of duplicates (50% to 70%). Most of the similar files have only partial match: ~20 bytes of ~1MB differ (different identifiers embedded).
I wondered if I could use duperemove
incrementally one directory at a time. But look like at least by default duperemove
does a lot of work when tried to deduplicate one file by reading (and maybe extracting extents?) for all known files.
The following seems to be enough to illustrate slowdown: we create N files and add files one by one into database (real use would be a directory-at-a-time):
$ rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i; done; }
I'll change 1000
to increase linearly to see how long it takes to process:
# 1k
real 0m8,429s
# 2k
real 0m21,543s (+13s)
# 3k
real 0m39,847s (+18s)
# 4k
real 1m4,735s (+25s)
# 5k
real 1m34,389s (+30s)
It's not a clean test as it also measures time to create the files. File creation is minor though: it takes 9ms to create 5k files.
I think we see a quadratic slowdown here. Is it an intended (required by design) behaviour? Or possibly an omission somewhere and incremental mode could be faster?
Thanks!
Hello @trofi
When you do a command like this: duperemove -hr --hashfile=/tmp/hashes some-directory
, duperemove
will list and process the stuff in some-directory
but will also check files found in the hashfile (do they still exist, maybe they changed etc)
For your use case, which is, as I understand it, some sort of append-only filesystem, all that work is useless and expensive
Could you check the quick code I push in the this branch ?
It adds a --ignore-existing-files
options for you to check
As a test, I created a directory that contains 100 sub-directory, with 1000 files each (to 100000 files in total)
The hashfile has been built:
0.49 [jack:/tmp/mnt] hashstats /tmp/hashes
Raw header info for "/tmp/hashes":
version: 3.0 block_size: 131072
num_files: 100000 num_hashes: 0
Print top 10 hashes (this may take some time)
Running it from master
, with just one directory as a target:
0.42 [jack:/tmp/mnt] /usr/bin/time -v duperemove -hr --hashfile=/tmp/hashes 1 > /dev/null
Command being timed: "duperemove -hr --hashfile=/tmp/hashes 1"
User time (seconds): 0.22
System time (seconds): 0.78
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 40576
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 9575
Voluntary context switches: 8
Involuntary context switches: 151
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
0.57 [jack:/tmp/mnt] /usr/bin/time -v strace -fty duperemove -hr --hashfile=/tmp/hashes 1 |& wc -l
508305
And with the option:
0.24 [jack:/tmp/mnt] /usr/bin/time -v duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files >/dev/null
Command being timed: "duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files"
User time (seconds): 0.08
System time (seconds): 0.03
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.12
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 22656
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 5086
Voluntary context switches: 8
Involuntary context switches: 14
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
0.77 [jack:/tmp/mnt] /usr/bin/time -v strace -fty duperemove -hr --hashfile=/tmp/hashes 1 --ignore-existing-files |& wc -l
13165
Running your test on my system:
From master
:
0.02 [jack:/tmp/test] 130 rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i >/dev/null; done; }
real 0m23.545s
user 0m6.188s
sys 0m17.464s
0.42 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 2000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i >/dev/null; done; }
real 1m0.945s
user 0m17.297s
sys 0m43.896s
1.06 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 3000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i >/dev/null; done; }
real 1m48.824s
user 0m32.310s
sys 1m16.854s
With the option:
1.39 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 1000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }
real 0m16.674s
user 0m5.141s
sys 0m11.602s
1.11 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 2000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }
real 0m37.582s
user 0m14.209s
sys 0m23.538s
1.25 [jack:/tmp/test] rm -rf t hf.db; time { mkdir t; for i in `seq 1 3000`; do echo $i > t/$i; duperemove --hashfile=hf.db t/$i --ignore-existing-files >/dev/null; done; }
real 1m2.863s
user 0m26.583s
sys 0m36.609s
Thank you for such a quick response! --ignore-existing-files
does indeed speed up incremental runs with a few caveats.
You are correct, it's a mostly append-only storage: less used directories are garbage-collected once in a few months and new directories are added every few minutes.
I tried to run the change on real storage (it's /nix/store
storage from nix
package manager) as:
$ cat ingest.bash
#!/usr/bin/env bash
find /nix/store -maxdepth 1 -type d |
while read d; do
case $d in
/nix/store) continue;;
/nix/store/.links) continue;;
esac
echo "Ingesting $d"
./duperemove -r --hashfile=h.db --ignore-existing-files $d "$@"
done
Visually scanning the output it looks like some files are kept being reported as candidates on every run. If I pick 2 tily directories and add them manually it seems to report unrelated duplication attempts:
# it has 4 small files
./duperemove -r --hashfile=h.db --ignore-existing-files /nix/store/6ghc5jz83f7r4cd4cnpzagbk6jn1w92y-lv-4.51
...
15728640 "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libblas.so.3"
15728640 "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libcblas.so.3"
15728640 "/nix/store/7d0pxnfwsjk9wynnyg0bajscx6n17gmy-lapack-3/lib/liblapacke.so.3"
Showing 1 identical extents of length 1173158 with id 9ed5bac9
Start Filename
0 "/nix/store/lmq1lqzh91knsc2i4a4yf3f58k1qg3s6-libmysofa-1.3.1/share/libmysofa/MIT_KEMAR_normal_pinna.sofa"
Showing 1 identical extents of length 1232695 with id 29000b05
Start Filename
0 "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/setuptools-65.5.0-py3-none-any.whl"
Showing 2 identical extents of length 1441792 with id bf04ff05
Start Filename
10616832 "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
10616832 "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Showing 1 identical extents of length 2055563 with id 80a4ef42
Start Filename
0 "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/pip-23.0.1-py3-none-any.whl"
Showing 2 identical extents of length 2490368 with id a3daa133
Start Filename
655360 "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
655360 "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Total files scanned: 2
Total extent hashes scanned: 3
# also has 4 small files, built against older libc version
$ ./duperemove -r --hashfile=h.db --ignore-existing-files /nix/store/7bzkrs0vgvr4fim4ayy0i8zqrf0z84zm-lv-4.51
...
15728640 "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libblas.so.3"
15728640 "/nix/store/pbay9rq3hjapfix5d51acxkplyhrnil4-blas-3/lib/libcblas.so.3"
15728640 "/nix/store/7d0pxnfwsjk9wynnyg0bajscx6n17gmy-lapack-3/lib/liblapacke.so.3"
Showing 1 identical extents of length 1173158 with id 9ed5bac9
Start Filename
0 "/nix/store/lmq1lqzh91knsc2i4a4yf3f58k1qg3s6-libmysofa-1.3.1/share/libmysofa/MIT_KEMAR_normal_pinna.sofa"
Showing 1 identical extents of length 1232695 with id 29000b05
Start Filename
0 "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/setuptools-65.5.0-py3-none-any.whl"
Showing 2 identical extents of length 1441792 with id bf04ff05
Start Filename
10616832 "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
10616832 "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Showing 1 identical extents of length 2055563 with id 80a4ef42
Start Filename
0 "/nix/store/wp6wyf46p68j7c7iy7i0l187bgkr727s-python3-3.10.12/lib/python3.10/ensurepip/_bundled/pip-23.0.1-py3-none-any.whl"
Showing 2 identical extents of length 2490368 with id a3daa133
Start Filename
655360 "/nix/store/w26n65yi1a2n9ziwna6a5dknanc8xfnc-icu4c-73.2/lib/libicudata.so.73.2"
655360 "/nix/store/7wl5vy21r1r0kh5mhliqwj1rplzaf8kl-icu4c-73.2/lib/libicudata.so.73.2"
Total files scanned: 2
Total extent hashes scanned: 3
It looks like libicudata.so.73.2
(and many others) gets brought up every time new run is started.
Do I understand correctly that duperemove
keeps considering and re-considering the same extents unrelated to incremental input for deduplication?
If I run it with --quiet
and remove the header here is how progress looks like:
Ingesting /nix/store/hxnwgjqv2ghx2ms9mqz4cz5w4mhgbc6y-file-5.45-dev
Found 18649 identical extents.
Simple read and compare of file data found 16938 instances of extents that might benefit from deduplication.
Total files scanned: 3
Ingesting /nix/store/7pynr1g9si15n72l0wzp8nz4qf5xb36x-libdbusmenu-qt
Found 18649 identical extents.
Simple read and compare of file data found 16938 instances of extents that might benefit from deduplication.
Total files scanned: 67
Ingesting /nix/store/n604q5jmihjc3qfhgbfwknrlsqbwj19x-libkate-0.4.1
Found 18523 identical extents.
Simple read and compare of file data found 16812 instances of extents that might benefit from deduplication.
Total files scanned: 154
Ingesting /nix/store/bdds9wax3qsnhjxfkinkxzdav3fdxriq-python3.10-pyudev-0.24.1
Found 18523 identical extents.
Simple read and compare of file data found 16812 instances of extents that might benefit from deduplication.
Total files scanned: 57
I suspect is means duperemove
keeps reconsidering deduping roughly the same ~18K extents extents on each run.
Hello
You are correct, existing files are always considered for deduplication
To be more specific, any new extent are matched against the whole dataset
There is an remaining optimisation: deduplicating the new extent with a single old extent (and not the whole list)
All old extents are already deduplicated, so they live on the same physical position
That is a more difficult change, I'll think about it
--dedupe-option=norescan_files
test: all good
Tested on a real system as:
$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | xargs -t -n 1000 ./duperemove -q -r --hashfile=h.db --dedupe-option=norescan_files
drop-v2
branch behaves quite nicely! Thank you!
Running directory-at-a-time is a bit slow (each directory gets processed at least in about one second), but batching directories 100-at-a-time of 1000-at-a-time (like below) makes it very snappy!
My final database is 3.5G
(looks decent).
--dedupe-option=norescan_files,partial
test: seems to hang
Possibly related question: an incremental mode is also useful to see if duperemove
gets stuck or is able to progress when ran against a small directory at a time.
I tried to increase deduplication precision by throwing --dedupe-option=norescan_files,partial
at my 3.5G
database:
$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | xargs -t -n 1 ./duperemove -q -r --hashfile=h.db --dedupe-option=norescan_files,partial
./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/57904rflsfh46y7fxi62rzc7sidc9vmq-user-environment
No dedupe candidates found.
./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
Found 1552412 identical extents.
And the output is stuck on it (30+ minutes without progress). The directory does not look big:
$ du -hs /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
1,2M /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
$ find /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125 | wc -l
14
Did duperemove
got stuck indefinitely? Or it's just superslow?
@trofi Could you checkout that branch again ? I have just pushed a couple more code
Also note that you will need to recreate the hashfile
The partial
mode is only used for new (or modified) files
Deleted hash file, checked out latest branch state. Getting the following crash:
$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | LANG=C xargs -t -n 1 ./duperemove -r --hashfile=h.db --dedupe-option=norescan_files,partial./duperemove -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/57904rflsfh46y7fxi62rzc7sidc9vmq-user-environment
Gathering file list...
Adding files from database for hashing.
No dedupe candidates found.
./duperemove -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/kpadhaqz7vx83rrdvnwrjndkxq9s5iil-roboto-mono-2.002-20190125
Gathering file list...
Adding files from database for hashing.
Loading only duplicated hashes from hashfile.
Found 1 identical extents.
ERROR: hash-tree.c:68
[stack trace follows]
./duperemove(print_stack_trace+0x2e) [0x40a48e]
./duperemove(insert_hashed_block+0x3f1) [0x406fe1]
./duperemove(dbfile_load_block_hashes+0x86) [0x410e06]
./duperemove(process_duplicates+0x1a7) [0x406777]
./duperemove(populate_tree+0x25a) [0x40c32a]
./duperemove(main+0x8bc) [0x405f7c]
/nix/store/905gkx2q1pswixwmi1qfhfl6mik3f22l-glibc-2.37-8/lib/libc.so.6(+0x23ace) [0x7ffff7b63ace]
/nix/store/905gkx2q1pswixwmi1qfhfl6mik3f22l-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89) [0x7ffff7b63b89]
./duperemove(_start+0x25) [0x406505]
xargs: ./duperemove: terminated by signal 6
gdb
's backtrace if it's of any help:
Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44 pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff7bc7af3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007ffff7b78c86 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007ffff7b628ba in __GI_abort () at abort.c:79
#4 0x0000000000406fe6 in insert_file_hash_head (head=<optimized out>, dups=<optimized out>) at hash-tree.c:128
#5 add_file_hash_head (block=<optimized out>, dups=<optimized out>) at hash-tree.c:151
#6 insert_hashed_block (tree=tree@entry=0x7fffffffaa30, digest=digest@entry=0x430348 "[\203\371\210\272\344UV+@G\035\022\344:\240", file=0x6206e0, loff=loff@entry=0, flags=flags@entry=8) at hash-tree.c:290
#7 0x0000000000410e06 in dbfile_load_block_hashes (hash_tree=hash_tree@entry=0x7fffffffaa30) at dbfile.c:1357
#8 0x0000000000406777 in process_duplicates () at duperemove.c:578
#9 0x000000000040c32a in populate_tree (cfg=cfg@entry=0x41b300 <dbfile_cfg>, batch_size=0, callback=callback@entry=0x4065d0 <process_duplicates>, scanned_filerec=scanned_filerec@entry=0x41b340 <scanned_filerec>)
at file_scan.c:1034
#10 0x0000000000405f7c in create_update_hashfile (filelist_idx=<optimized out>, argv=<optimized out>, argc=<optimized out>) at duperemove.c:652
#11 main (argc=<optimized out>, argv=<optimized out>) at duperemove.c:741
@trofi I have struggle reproducing our issue :/
On the other hand and after checking a bit deeper, that commit is buggy so I have reworked it
Tried commit 4a48838. It is able to run agains a few directories successfully and then fails:
$ rm h.db
$ find /nix/store -maxdepth 1 -type d | fgrep -v '/nix/store/.links' | grep -Pv /nix/store$ | LANG=C xargs -t -n 1 ./duperemove -r --hashfile=h.db --dedupe-option=norescan_files,partial
...
[19/19] (100.00%) csum: /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
Loading only duplicated hashes from hashfile.
Found 1 identical extents.
ERROR: hash-tree.c:68
[stack trace follows]
./duperemove(print_stack_trace+0x2e) [0x40a40e]
./duperemove(insert_hashed_block+0x3f1) [0x406f61]
./duperemove(dbfile_load_block_hashes+0x86) [0x410d76]
./duperemove(process_duplicates+0x138) [0x4066f8]
./duperemove(populate_tree+0x1b6) [0x40c206]
./duperemove(main+0x8b4) [0x405f74]
/nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6(+0x23ace) [0x7ffff7b63ace]
/nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89) [0x7ffff7b63b89]
./duperemove(_start+0x25) [0x4064f5]
xargs: ./duperemove: terminated by signal 6
gdb
backtrace:
Program received signal SIGABRT, Aborted.
0x00007ffff7bc7a8c in __pthread_kill_implementation () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
(gdb) bt
#0 0x00007ffff7bc7a8c in __pthread_kill_implementation () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#1 0x00007ffff7b78c86 in raise () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#2 0x00007ffff7b628ba in abort () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#3 0x0000000000406f66 in insert_file_hash_head (head=<optimized out>, dups=<optimized out>) at hash-tree.c:128
#4 add_file_hash_head (block=<optimized out>, dups=<optimized out>) at hash-tree.c:151
#5 insert_hashed_block (tree=tree@entry=0x7fffffffaaa0, digest=digest@entry=0x4307c8 "[\203\371\210\272\344UV+@G\035\022\344:\240", file=0x45b4e0,
loff=loff@entry=0, flags=flags@entry=8) at hash-tree.c:290
#6 0x0000000000410d76 in dbfile_load_block_hashes (hash_tree=hash_tree@entry=0x7fffffffaaa0) at dbfile.c:1357
#7 0x00000000004066f8 in process_duplicates () at duperemove.c:575
#8 0x000000000040c2d6 in populate_tree (cfg=cfg@entry=0x41b300 <dbfile_cfg>, batch_size=0, callback=callback@entry=0x4065c0 <process_duplicates>)
at file_scan.c:1039
#9 0x0000000000405f74 in create_update_hashfile (filelist_idx=<optimized out>, argv=<optimized out>, argc=<optimized out>) at duperemove.c:644
#10 main (argc=<optimized out>, argv=<optimized out>) at duperemove.c:732
There is another bug report that seems similar: #285
@trofi Could you send me your hashfile (the one that produces the crash) ?
Sure! https://slyfox.uni.cx/b/2023-09-09-h.db is a 3MB h.db
I got. Should not contain any sensitive data.
Thank you!
sqlite> SELECT * FROM hashes JOIN (SELECT DISTINCT digest FROM hashes WHERE digest IN (SELECT DISTINCT digest FROM hashes WHERE ino IN (SELECT DISTINCT ino FROM files WHERE dedupe_seq > (SELECT keyval FROM config WHERE keyname = 'dedupe_sequence'))) GROUP BY digest HAVING count(*) > 1) AS duplicate_hashes on hashes.digest = duplicate_hashes.digest;
[���UV+@G�:�|3730010316|848|0|8|[���UV+@G�:�
[���UV+@G�:�|3730010316|848|0|8|[���UV+@G�:�
So there are indeed duplicates records in the hashfile
I have added unique constraints to the prevent such record from being inserted
This way, you will crash earlier :)
I will dig deeper
I think it detects the constraint violation now. But looks like it does not cause the duperemove
to stop and instead it loops over and over:
$ ./duperemove -q -r '--hashfile=h.db' '--dedupe-option=norescan_files,partial' /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin
Warning: The hash file format in Duperemove master branch is under development and may change.
If the changes are not backwards compatible, you will have to re-create your hash file.
dbfile_store_block_hashes()/2739581: Database error 19 while executing statement: constraint failed
Skipping file due to error 19 from function csum_by_extent (No such device), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
dbfile_begin_trans()/2739578: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/tzselect
dbfile_begin_trans()/2739586: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/xtrace
dbfile_begin_trans()/2739579: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/localedef
dbfile_begin_trans()/2739570: Database error 1 while starting transaction: SQL logic error
Skipping file due to error 1 from function csum_by_extent (Operation not permitted), /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/nscd
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Found 0 identical extents.
...
Nothing fancy on subvolumes side, but yes on hardlinks. All identical files are already hardlinked to the same inodes:
$ LANG=C stat /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64 /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
File: /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64
Size: 29248 Blocks: 64 IO Block: 4096 regular file
Device: 0,30 Inode: 3730010316 Links: 5
Access: (0555/-r-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-08-05 11:00:51.000000000 +0100
Modify: 1970-01-01 01:00:01.000000000 +0100
Change: 2023-08-05 11:00:52.321137476 +0100
Birth: 2023-08-05 11:00:51.598124397 +0100
File: /nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
Size: 29248 Blocks: 64 IO Block: 4096 regular file
Device: 0,30 Inode: 3730010316 Links: 5
Access: (0555/-r-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-08-05 11:00:51.000000000 +0100
Modify: 1970-01-01 01:00:01.000000000 +0100
Change: 2023-08-05 11:00:52.321137476 +0100
Birth: 2023-08-05 11:00:51.598124397 +0100
I have 4 names for it:
$ find /nix/store/ -inum 3730010316
/nix/store/.links/0miarylrg34m2gn2hji7a8q6rvf96zicp93q7rf87w7dwfmc41l5
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V6_LP64_OFF64
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/POSIX_V7_LP64_OFF64
/nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/libexec/getconf/XBS5_LP64_OFF64
/nix/store/hhcnwljvhlg4ib1k7zc27pg23611sh64-glibc-2.37-8-bin/bin/getconf
Ok, more bugs were squashed
I have added more constraints on the hashfile database, as well as, hopefully, fixed your issue (c89a771)
Also, the partial lookup is still a bit slow on my system, it appears that the lookaside allocator is disabled at compile-time on libsqlite3 on Debian
An issue has been raised
edit: more bugs found .. (and fixed with 1657ab8)
I somehow am seeing SIGSEGVs in multithreaded mode on ./duperemove -q -r '--hashfile=h.db' '--dedupe-options=norescan_files' -d ...
(commit 5ad3361).
gdb
backtrace:
(gdb) bt
#0 0x00007f4b16fbcded in sqlite3DbMallocRawNN ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#1 0x00007f4b16fc04fb in sqlite3DbStrDup ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#2 0x00007f4b16fce527 in sqlite3VdbeScanStatus ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#3 0x00007f4b16fce5db in sqlite3WhereAddScanStatus ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#4 0x00007f4b1703a4db in sqlite3WhereBegin ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#5 0x00007f4b17061484 in sqlite3Update.constprop.0 ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#6 0x00007f4b170445a4 in yy_reduce.constprop ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#7 0x00007f4b17045de7 in sqlite3RunParser ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#8 0x00007f4b1704672f in sqlite3Prepare ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#9 0x00007f4b17046ba3 in sqlite3LockAndPrepare ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#10 0x00007f4b17046d06 in sqlite3_prepare_v2 ()
from /nix/store/d9n3qjjnzwna2igf8pr4z896ssxmyxfp-sqlite-3.42.0/lib/libsqlite3.so.0
#11 0x000000000040f28e in dbfile_update_extent_poff (db=db@entry=0x11c8778,
ino=4451283826, subvol=848, loff=3276800, poff=2009557819392) at dbfile.c:343
#12 0x000000000040d6dc in extent_dedupe_worker (kern_bytes=0x7f4b067fbd80,
fiemap_bytes=<synthetic pointer>, dext=<optimized out>) at run_dedupe.c:518
#13 dedupe_worker (priv=<optimized out>, counts=0x7ffee50daf20) at run_dedupe.c:535
#14 0x00007f4b1716511a in g_thread_pool_thread_proxy ()
from /nix/store/hl36mh6qdb8a7ymhql4qmfza3cdr24in-glib-2.76.4/lib/libglib-2.0.so.0
#15 0x00007f4b171647cd in g_thread_proxy ()
--Type <RET> for more, q to quit, c to continue without paging--
from /nix/store/hl36mh6qdb8a7ymhql4qmfza3cdr24in-glib-2.76.4/lib/libglib-2.0.so.0
#16 0x00007f4b16e25dd4 in start_thread () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
#17 0x00007f4b16ea79b0 in clone3 () from /nix/store/aw2fw9ag10wr9pf0qk4nk5sxi0q0bn56-glibc-2.37-8/lib/libc.so.6
Adding --io-threads=1 --cpu-threads=1
makes the crash to go away.
Hello @trofi
All the code has been merged in master
I am currently running it against my live system, don't hesitate to report any issues you found
Thank you
Sounds good. Let's close this issue then.