FAQ: memory requirements when using --hashfile
Closed this issue · 3 comments
How much memory does duperemove
require when using a hashfile, based on hashfile size and/or data?
I am currently trying to deduplicate some 3.5 TB of data with the --hashfile
option and am now several days into the Loading only duplicated hashes from hashfile
phase.
The hashfile is 21.5 GB in size. For the last couple of days, memory usage by duperemove
has been oscillating between 12–14 GB; my system has 15 GB memory + 8 GB swap.
Behavior seems to indicate that duperemove
tightens its belt as available memory decreases (memory usage is currently at 96%, swap at 74%), but that is hard to tell without examining the source code. No idea if memory is sufficient to proceed to the next phase.
I understand that all of this largely depends on how much of the data is duplicated (in this case, most of the data on the drive should be present in two physical copies). A progress indicator for this phase would possibly help me make a better estimate.
However, a FAQ entry would help shed some more light on this:
- How much memory is required per hash and per file when using
--hashfile
? Does the number of duplicates matter (e.g. 2 vs. 10 identical files)? Can I make any guesstimates based on hashfile size if I have a rough idea about the amount of dupes on my drive? - Is
duperemove
capable of handling low-memory situations? (E.g. when memory runs out, deduplicate the dupes identified so far and free up some memory, then continue loading hashes.) Or will it just run out and crash? - Does
duperemove
have other belt-tightening potential, such as cached data that speeds up things but can be deleted to free up memory if needed, at the expense of degraded performance?
it seems like the block size matters a lot in regards to memory consumption see also #288
Hello @mvglasow
Using --hashfile
is always recommended: otherwise, the sqlite database is stored in memory
Could you try the code from master, with the --batchsize
option ?
I understand that you would like to run duperemove
against a large dataset and this option is made to improve the situation there
Also, could you tell me what options of duperemove
you are using ? Especially regarding --dedupe-options
and -b
(blocksize)
Hello
Some numbers about the v0.13 release has been published
Please reopen this if you still feel there is an issue