sumfolder1: A Python repository from ross-spencer

sumfolder1 is a utility for use within the archival and digital preservation community to generate checksums for file system directories, and to generate an overall "collection" checksum for a given set of files.

Why?
How?
DROID
Installation
Usage
Previous work
License

Why?

Conventionally, checksums exist for files, they do not exist for directories. They have no payload that can be summed together to calculate a digest/checksum.

If it were possible to create checksums for folders or a global checksum for a collection of objects, it would become possible to ask the following:

Archival questions

What is the collection checksum for a given set of files and folders?
What is the checksum for a given folder?
Given a collection of objects online, am I looking at an authentic listing?
Have I downloaded a collection in its entirety?

Structural questions

Is file/folder hash(x) included in the collection set?
Is file/folder hash(y) (non-existent) part of the entire set?
Is file hash(x) part of folder(y) where the collection has arbitrary depth?
Where are duplicate checksums located within a collection?

Forensics questions

Has a digital object been removed from the collection?
Did the collection contain at least one empty directory?

How?

Given a set of file paths and existing checksums it is possible to compute a checksum for a folder by creating a checksum of the given checksums.

Given checksum 1) 7c1f9f9a4d0ce9a72ee63f37a1b7f694 and checksum 2) aececec0bc3f515039aec9e60c413cd3 an MD5 can be computed as: 82f9e9a4305714fffdd7932783980cbc.

We can see this illustrated for a small collection as follows:

📁 folder_1 82f9e9a4305714fffdd7932783980cbc
    📄 checksum_1 7c1f9f9a4d0ce9a72ee63f37a1b7f694
    📄 checksum_2 aececec0bc3f515039aec9e60c413cd3

If we follow this approach through an entire directory structure we can create checksums for all sub-directories and for the collection as a whole.

Reference set

A reference set is provided with this repository: reference set.

We can iterate through the directory tree to create sets of directory checksums and a collection checksum: 52b94608dc70813aa88dae01176dc73b.

The reference set then looks as follows:

📁 collection 93778c524035d5d3e429a2fe43b7700a
   📄 file_0001 14118ff9ad0344decb37960809b2f17a
   📄 file_0000 8cfda2609b880a553759cd6200823f3b
   📄 file_0002 a4501ee1a5c711ea9db78a800a24e830
   📁 sub_dir_1 82301616d7e24f474dbe21de93af0a34
      📄 file_empty d41d8cd98f00b204e9800998ecf8427e
      📄 file_0003 dc7f828c5fe622925181d06edada350f
      📄 file_0004 e3d90a4bf14a9b355f0e69ba08df522d
      📁 sub_1_dir_1 1c7ba27edf1356d097a3f568032430c2
         📄 file_0005 637a3fb7da1ab61d10e96336d9758416
   📁 sub_dir_2 1ccb49edc4e873f1a8affd4bad5e9b90
   📁 sub_dir_3 2a60541cede91a36e9dc5bab7a97dd6e
      📁 sub_3_empty_1 db9d848b4f83ff3cb3faa4df0a59e3e1
         📁 sub_3_empty_2 1ccb49edc4e873f1a8affd4bad5e9b90
   📁 sub_dir_4 272d45767d534335163f220c1d40e559
      📄 file_0006 2b43227486ec8744cd5d4c955d269743
      📄 file_0007 c5a1973a70e08bf1eee13b8090f790ad
      📄 file_0008 fdffe4dd2d39c7d9986dbf5c6ec5ad39
   📁 sub_dir_5 d818d29b75f89a9b5d8d1c5a4c70dbbb
      📁 sub_5_dir_1 82f9e9a4305714fffdd7932783980cbc
         📄 file_0009 7c1f9f9a4d0ce9a72ee63f37a1b7f694
         📄 file_0010 aececec0bc3f515039aec9e60c413cd3
   📁 sub_dir_6 74be16979710d4c4e7c6647856088456
      📄 file_empty d41d8cd98f00b204e9800998ecf8427e

Reference implementation

The reference implementation for sumfolder1 does the following:

From the lowest sub-directory in the tree:

Check for sub-directories and add the checksums for these to a hash digest in alphabetical order by checksum.
For files in the directory add these to the hash digest in alphabetical order by checksum.
Create a digest for the list of checksums.

Repeat, processing each folder backwards up to the top level.

NB. If a folder is completely empty it is assigned a constant value chosen in the code: 2600_EMPTY_DIRECTORY. This evaluates to an MD5 value of 1ccb49edc4e873f1a8affd4bad5e9b90.

Merkle trees

The concept I have used here is based on Merkle trees and a loose understanding of techniques used in the block-chain and in the source control system GitHub.

A good video summary of Merkle trees can be found on YouTube:

Gaurav Sen on Merkle Trees

And a Python tutorial I found useful in starting this work:

Dan Nolan on Merkle Trees

The technique required for a directory tree is a little more convoluted than that of a Merkle tree which uses binary nodes and evaluates checksums from left to right. I believe the implementation used for sumfolder1 is more closely aligned to that of a "Radix Tree" or "Patricia Tree", however, this is to be explored more.

NB. A merkle tree can be used in its context for performance; sumfolder1 does not yet have a performance use-case.

Terminology

The reference implementation introduces some terminology that helps with understanding the approach:

Active-tree: the side of a directory tree that we're querying about a given hash.
Non-active-tree: the tree at root node (Rn+1) that do not contain the digital object that we're querying.
Root-node (Rn): the name of the top-level node, i.e. collection folder. This is either artificially created for a set of directories all at the same level, or exists as a function of the given collection set.

New folder attributes

Folder objects need to be given additional attributes to enable the algorithm to work.

Folder-depth, so directories can be grouped and distinguished from one-another by level in the hierarchy.
Hash, the goal of this tool is to enable a hash to be calculated for an entire collection.

Sensitivity

I am trying to make this code as portable as possible, i.e. while it works with DROID-style reports today, it might also work with other checksum-based outputs tomorrow. Additionally, to be able to compare folder structures, this utility may also work with DROID-style reports later on in a transfer workflow; at which point, folders and files may have been renamed, but their content remains consistent.

To calculate a single folder checksum we currently do the following:

If there are folders in the directory, order their hashes alphabetically and add to a list.
File checksums are then ordered alphabetically and added to the end of the list.
The checksums are then summed together to create a new folder-level checksum.

DROID

sumfolder1 uses the DROID format identification report to generate folder level checksums.

DROID can be found at The National Archives UK website:

DROID @ The National Archives

DROID in Siegfried

sumfolder1 can also be used with DROID compatible reports created by Siegfried using a command such as follows:

sf --hash=md5 --droid <collection_folder>

DROID as an inspiration

File format reports provide a means of statically analyzing collections of digital objects. A DROID report satisfies the pre-conditions required to create reliable folder- and collection-level checksums for digital collections:

A collection is static, i.e. unlikely to change.
Digital objects within the collection have checksums.

NB: A collection need not be static to be analyzed but it is not the primary use-case of this utility.

More information about the different uses for a file-format identification report can be found in my paper in the Code4Lib journal.

Fractal in detail: What information is in a file format identification report?

Writing about sumfolder1

I wrote a blog describing the utility on the OPF website.

What is the checksum of a directory?

Installation

sumfolder1 is available on pypi and can be installed as follows:

pip install -U sumfolder1

Usage

sumfolder1 has the following usage instructions:

usage: sumfolder1.py [-h] [--csv CSV] [--demo] [--ref] [-v]

Calculate checksums for folders in a collection of objects using a DROID format
identification report

options:
  -h, --help          show this help message and exit
  --csv CSV           Single DROID CSV to read.
  --demo              Run demo queries and output a tree to demo.txt
  --ref, --reference  Write reference set to stdout.
  -v, --version       Return version information.

Demo output

sumfolder1's demo output can be invoked as follows:

python sumfolder1 --demo

JSON will be output to stdout describing a handful of queries generated using the reference collection.

An visualization of the collection tree will be output (for demo purposes) to stderr.

Use with a DROID csv

Given a DROID csv the tool can be invoked as follows:

python sumfolder1 --csv <droid_csv_file>

Outputting the reference CSV

A reference CSV can be output to stdout. Ideally it is piped to some other file using a command such as follows:

python sumfolder1 --ref > <output_file>

Previous work

Previous work in this area.

Check out direct-dedupe-1 from Stefana Breitwieser which I was recently made aware of via the BitCurator Forum 2024 and provides a shell script to calculate checksums for sub-directories providing a very pragmatic way to help users dedupe at the folder level.

License

This work is license using: GNU GENERAL PUBLIC LICENSE Version 3.

ross-spencer/sumfolder1