sumfolder1 is a utility for use within the archival and digital preservation community to generate checksums for file system directories, and to generate an overall "collection" checksum for a given set of files.
Conventionally, checksums exist for files, they do not exist for directories. They have no payload that can be summed together to calculate a digest/checksum.
If it were possible to create checksums for folders or a global checksum for a collection of objects, it would become possible to ask the following:
- What is the collection checksum for a given set of files and folders?
- What is the checksum for a given folder?
- Given a collection of objects online, am I looking at an authentic listing?
- Have I downloaded a collection in its entirety?
- Is file/folder hash(x) included in the collection set?
- Is file/folder hash(y) (non-existent) part of the entire set?
- Is file hash(x) part of folder(y) where the collection has arbitrary depth?
- Where are duplicate checksums located within a collection?
- Has a digital object been removed from the collection?
- Did the collection contain at least one empty directory?
Given a set of file paths and existing checksums it is possible to compute a checksum for a folder by creating a checksum of the given checksums.
Given checksum 1) 7c1f9f9a4d0ce9a72ee63f37a1b7f694
and checksum 2)
aececec0bc3f515039aec9e60c413cd3
an MD5 can be computed as:
82f9e9a4305714fffdd7932783980cbc
.
We can see this illustrated for a small collection as follows:
📁 folder_1 82f9e9a4305714fffdd7932783980cbc
📄 checksum_1 7c1f9f9a4d0ce9a72ee63f37a1b7f694
📄 checksum_2 aececec0bc3f515039aec9e60c413cd3
If we follow this approach through an entire directory structure we can create checksums for all sub-directories and for the collection as a whole.
A reference set is provided with this repository: reference set.
We can iterate through the directory tree to create sets of directory checksums
and a collection checksum: 52b94608dc70813aa88dae01176dc73b
.
The reference set then looks as follows:
📁 collection 93778c524035d5d3e429a2fe43b7700a
📄 file_0001 14118ff9ad0344decb37960809b2f17a
📄 file_0000 8cfda2609b880a553759cd6200823f3b
📄 file_0002 a4501ee1a5c711ea9db78a800a24e830
📁 sub_dir_1 82301616d7e24f474dbe21de93af0a34
📄 file_empty d41d8cd98f00b204e9800998ecf8427e
📄 file_0003 dc7f828c5fe622925181d06edada350f
📄 file_0004 e3d90a4bf14a9b355f0e69ba08df522d
📁 sub_1_dir_1 1c7ba27edf1356d097a3f568032430c2
📄 file_0005 637a3fb7da1ab61d10e96336d9758416
📁 sub_dir_2 1ccb49edc4e873f1a8affd4bad5e9b90
📁 sub_dir_3 2a60541cede91a36e9dc5bab7a97dd6e
📁 sub_3_empty_1 db9d848b4f83ff3cb3faa4df0a59e3e1
📁 sub_3_empty_2 1ccb49edc4e873f1a8affd4bad5e9b90
📁 sub_dir_4 272d45767d534335163f220c1d40e559
📄 file_0006 2b43227486ec8744cd5d4c955d269743
📄 file_0007 c5a1973a70e08bf1eee13b8090f790ad
📄 file_0008 fdffe4dd2d39c7d9986dbf5c6ec5ad39
📁 sub_dir_5 d818d29b75f89a9b5d8d1c5a4c70dbbb
📁 sub_5_dir_1 82f9e9a4305714fffdd7932783980cbc
📄 file_0009 7c1f9f9a4d0ce9a72ee63f37a1b7f694
📄 file_0010 aececec0bc3f515039aec9e60c413cd3
📁 sub_dir_6 74be16979710d4c4e7c6647856088456
📄 file_empty d41d8cd98f00b204e9800998ecf8427e
The reference implementation for sumfolder1 does the following:
From the lowest sub-directory in the tree:
- Check for sub-directories and add the checksums for these to a hash digest in alphabetical order by checksum.
- For files in the directory add these to the hash digest in alphabetical order by checksum.
- Create a digest for the list of checksums.
Repeat, processing each folder backwards up to the top level.
NB. If a folder is completely empty it is assigned a constant value chosen in the code:
2600_EMPTY_DIRECTORY
. This evaluates to an MD5 value of1ccb49edc4e873f1a8affd4bad5e9b90
.
The concept I have used here is based on Merkle trees and a loose understanding of techniques used in the block-chain and in the source control system GitHub.
A good video summary of Merkle trees can be found on YouTube:
And a Python tutorial I found useful in starting this work:
The technique required for a directory tree is a little more convoluted than that of a Merkle tree which uses binary nodes and evaluates checksums from left to right. I believe the implementation used for sumfolder1 is more closely aligned to that of a "Radix Tree" or "Patricia Tree", however, this is to be explored more.
NB. A merkle tree can be used in its context for performance; sumfolder1 does not yet have a performance use-case.
The reference implementation introduces some terminology that helps with understanding the approach:
- Active-tree: the side of a directory tree that we're querying about a given hash.
- Non-active-tree: the tree at root node (Rn+1) that do not contain the digital object that we're querying.
- Root-node (Rn): the name of the top-level node, i.e. collection folder. This is either artificially created for a set of directories all at the same level, or exists as a function of the given collection set.
Folder objects need to be given additional attributes to enable the algorithm to work.
- Folder-depth, so directories can be grouped and distinguished from one-another by level in the hierarchy.
- Hash, the goal of this tool is to enable a hash to be calculated for an entire collection.
I am trying to make this code as portable as possible, i.e. while it works with DROID-style reports today, it might also work with other checksum-based outputs tomorrow. Additionally, to be able to compare folder structures, this utility may also work with DROID-style reports later on in a transfer workflow; at which point, folders and files may have been renamed, but their content remains consistent.
To calculate a single folder checksum we currently do the following:
- If there are folders in the directory, order their hashes alphabetically and add to a list.
- File checksums are then ordered alphabetically and added to the end of the list.
- The checksums are then summed together to create a new folder-level checksum.
sumfolder1 uses the DROID format identification report to generate folder level checksums.
DROID can be found at The National Archives UK website:
sumfolder1 can also be used with DROID compatible reports created by Siegfried using a command such as follows:
sf --hash=md5 --droid <collection_folder>
File format reports provide a means of statically analyzing collections of digital objects. A DROID report satisfies the pre-conditions required to create reliable folder- and collection-level checksums for digital collections:
- A collection is static, i.e. unlikely to change.
- Digital objects within the collection have checksums.
NB: A collection need not be static to be analyzed but it is not the primary use-case of this utility.
More information about the different uses for a file-format identification report can be found in my paper in the Code4Lib journal.
I wrote a blog describing the utility on the OPF website.
sumfolder1 is available on pypi and can be installed as follows:
pip install -U sumfolder1
sumfolder1 has the following usage instructions:
usage: sumfolder1.py [-h] [--csv CSV] [--demo] [--ref] [-v]
Calculate checksums for folders in a collection of objects using a DROID format
identification report
options:
-h, --help show this help message and exit
--csv CSV Single DROID CSV to read.
--demo Run demo queries and output a tree to demo.txt
--ref, --reference Write reference set to stdout.
-v, --version Return version information.
sumfolder1's demo output can be invoked as follows:
python sumfolder1 --demo
JSON will be output to stdout
describing a handful of queries generated using
the reference collection.
An visualization of the collection tree will be output (for demo purposes) to
stderr
.
Given a DROID csv the tool can be invoked as follows:
python sumfolder1 --csv <droid_csv_file>
A reference CSV can be output to stdout
. Ideally it is piped to some other
file using a command such as follows:
python sumfolder1 --ref > <output_file>
Previous work in this area.
- Check out direct-dedupe-1 from Stefana Breitwieser which I was recently made aware of via the BitCurator Forum 2024 and provides a shell script to calculate checksums for sub-directories providing a very pragmatic way to help users dedupe at the folder level.
This work is license using: GNU GENERAL PUBLIC LICENSE Version 3.