restic/fakedatafs

consider merging with others

anarcat opened this issue · 3 comments

borg has a similar project, called backupdata, with similar goals. backupdata works as a simple commandline tool to create actual data with small variations between files to have different files that could trigger different scenarios in deduplication algorithms.

obnam has its own tool to generate backup data called... genbackupdata. the design is similar to borg's, but like fakedatafs, it is also deterministic.

i was wondering if any thought has been given to reviewing or even merging with those projects instead of maintaining (now at least) 3 different projects in parallel?

if we want a standardized test suite for backup software, it seems to me we should aim at standardizing the corpus generation as well! :)

a similar issue was created in borg at borgbackup/backupdata#2

fd0 commented

Thanks for leaving an issue about this. I was aware of genbackupdata, not so much about backupdata. With fakedatafs, I aim to implement a superset of features, including a fuse mount.

Implementing as a FUSE fs is a nice idea.

Some questions:

  • is it fast enough? like SSD performance?
  • how does it generate "realistic" data? pure random and pure zeros are easy, but not very realistic.
  • is there something I can read about this code / the current feature set? the README does not tell much.

the code i wrote for borg (backupdata):

  • starts from real input data (problem: distributing that, so everybody has same)
  • so the backup tool sees realistic file sizes, file names/types, file content.
  • when "multiplying" that file set, one has to "spoil" the files to counteract deduplication. i just inserted counters every now and then into the files, so the chunker won't cut the same chunks.
fd0 commented

It is fast, tar c on the mount runs with 350MiB/s on my machine. The generated data is pseudo-random, not "realistic". I have plans for making different types of files, e.g. ASCII text, bz2, random data etc. That's not implemented right now. The code is also not documented (aside the code itself), it is a tool I'm sometimes using to test restic with large data sets.