python and fuse based deduplication filesystem
A simple deduplicating fuse based filesystem with very limited memory requirements.
This version is for beta production.
Requirements:
- fuse
- hashlib - builtin python
- cPickle - builtin python
- gdbm - use to store blockdigest
Concept:
PyDedupFS deduplicate Blocks of data at a given fixed length (default=128k). These blocks are hashed with help from hashlib (default hashlib.sha1). For every file there is also a whole file digest with with the same hashin function. (So you can verify the stored file, with this original file digest)
PyDedupfs is designed to be simple and uses as much features from the underlying filesystem. Blocks are stored as ordinary files with hexdigest as name.
According to this, a file stored in PyDedupFS is stored with multiple files:
- one file for original file information and meta data
- (size of file / blocksize) files for stored blocks
Files are stored in the real filesystem, but the content of the file is cPickled information how to assemble the original data, a digest over the whole file an stat structure. ( a python tuple (digest, st, sequence) ) You get this Information simple by reading this file with cPickle.load
The only database is gdbm to store blockhash to reference counter. The reference counter is necessary for delete operations, to delete only unused blocks, and to find existing blocks.
Diffences to other deduplicating Filesystems:
-
does not use huge amounts of memory
-
does not use a database for file structure The files reside in the real filesystem, but they are only filled with information how to assemble the real data. Filesystems do that for a long time, why implement a filesystem structure in a RDBMS -> it will be slower in the most cases. ( tested in experimental branch )
-
does not use a database for block storage in blobs filesystems can store data better than databases database overhead is significant ( tested in experimental branch )
-
disk based block digest dictionary, based on gdbm it is robust and standard, and uses minimal memory
Architecure:
PyDeduFS creates 3 Directories under BASE (option --base)
-
BASE/meta In this directory the gdbm database for block digest and reference counter is stored
-
BASE/files Original files filled with assemling information
-
BASE/blocks blocks of data with hexdigest as name
How it works:
a short explanation how PyDedupFS works
Read from File XYZ.txt
- get real filename for XYZ.txt under directory
- get information out of /XYZ.txt with cPickle.load()
- get sequence of blocks to assemble data
- add 0x00 EOF
Write file XYZ.txt
- split file inline into 128k Blocks
- store Block if not exists and store digest Blockdigest dictionary
- store file information - digest of whole file, stat struct, and sequence in /XYZ.txt
Usage:
Download via github
PyDeduFs.py --base=
PyDedupFs will go in background and default will log to /tmp/pydedupfs.log
Non Fuse options:
--base = base directory to store real data --hashfunc = hashing function of hashlib to use ( sha1, md5, sha256 ... ), default "sha1" ! dont change this after first use of filesystem ! --blocksize = blocksize to split data in, default 128k ! dont change this after first use of filesystem !
Fuse options set in program:
multipathing off direct_io off
Logging:
logging can be adjusted in logging.conf
Filesystem Feature:
implements standard fuse method, but
- symlink - very hard to implement with conecpt of PyDedupFS
- readlink - no symlink, no readlink
- truncate
- ftruncate