markfasheh/duperemove

FAQ: memory requirements when using --hashfile

Closed this issue · 3 comments

How much memory does duperemove require when using a hashfile, based on hashfile size and/or data?

I am currently trying to deduplicate some 3.5 TB of data with the --hashfile option and am now several days into the Loading only duplicated hashes from hashfile phase.

The hashfile is 21.5 GB in size. For the last couple of days, memory usage by duperemove has been oscillating between 12–14 GB; my system has 15 GB memory + 8 GB swap.

Behavior seems to indicate that duperemove tightens its belt as available memory decreases (memory usage is currently at 96%, swap at 74%), but that is hard to tell without examining the source code. No idea if memory is sufficient to proceed to the next phase.

I understand that all of this largely depends on how much of the data is duplicated (in this case, most of the data on the drive should be present in two physical copies). A progress indicator for this phase would possibly help me make a better estimate.

However, a FAQ entry would help shed some more light on this:

  • How much memory is required per hash and per file when using --hashfile? Does the number of duplicates matter (e.g. 2 vs. 10 identical files)? Can I make any guesstimates based on hashfile size if I have a rough idea about the amount of dupes on my drive?
  • Is duperemove capable of handling low-memory situations? (E.g. when memory runs out, deduplicate the dupes identified so far and free up some memory, then continue loading hashes.) Or will it just run out and crash?
  • Does duperemove have other belt-tightening potential, such as cached data that speeds up things but can be deleted to free up memory if needed, at the expense of degraded performance?

it seems like the block size matters a lot in regards to memory consumption see also #288

Hello @mvglasow

Using --hashfile is always recommended: otherwise, the sqlite database is stored in memory

Could you try the code from master, with the --batchsize option ?
I understand that you would like to run duperemove against a large dataset and this option is made to improve the situation there

Also, could you tell me what options of duperemove you are using ? Especially regarding --dedupe-options and -b (blocksize)

Hello

Some numbers about the v0.13 release has been published

Please reopen this if you still feel there is an issue