/csdb

A data hoarder's checksum database tool

Primary LanguageLua

What is this?

Resurrecting an ancient project, a "mostly read-only file management tool". It's intended for keeping a large list of checksums in a database so that duplication, movement, and corruption of files can be detected. In addition to maintaining a singular database, it also offers cross-database functionality.

We speak of "observations" to mean an association of a file path and its contents (or, at least, their cryptographic checksum). Most operations on the checksum database pertain to one or more observations.

Theory of Operation

This program is just a shim around a database; it does not interact with the filesystem much itself. Instead, it should be used in composition with things like find and the GNU coreutils digest programs (e.g. sha512sum), delegating details of filesystem traversal and choice of hash and so on to the user.

Dependencies

This program requires...

  • either the Lua 5.3 interpreter or luajit,
  • the Lua argparse and penlight libraries, and
  • lua-dbi and its lua-dbi-sqlite3 driver.

Supported Operations

To reduce clutter, many of the examples here rely on cdb's ability to pull the default database from the $CDB environment variable. If that's not what you want, add --db ${DB} to the invocation of cdb.

Initialize A Database

cdb init

Observe A Path

Add the checksum of a single path to the database. This will create a new checksum and/or a new path identifier as needed and will bind them together.

sha512sum $FILE | cdb addh

Or, for all files under a path:

find $DIR -type f -exec sha512sum {} \+ | cdb addh

If we have a pile of digest files already, each of which contains digests of paths relative to its location, we can generate a database, ${DB2} from them with the assistance of the cdb-util digest-relativize tool (see :ref:`below <cdb-util_drel>`):

find ${DIR} -type f -name SHA512SUMS -print0 | cdb-util drel -1 | cdb addh

Revalidate A Path Observation

Measure the checksum of a path and confirm that the database already held that observation. Reports unexpected files as well as mis-checksummed contents.

sha512sum $FILE | cdb verh

Or for all files under a path:

find $DIR -type f -exec sha512sum {} \; | cdb verh

This processing of digest streams is to be preferred to verifying a digest stream as generated by the database, e.g.:

cdb look \* | sha512sum -c

because the former can be more informative in the case of mismatching digests (specifically, the database can look for other paths that have the reported digest). If it's easier to have the database generate the set of files, that can be done:

cdb look \* --format '$u$z' --nul | xargs -0 sha512sum | cdb verh

Add Missing Checksums

We can augment a database of files by filtering a list of files we have to exclude the list of files we know about. If, however, there is a possibility that some of these files are duplicates of ones already in the database, you may be better off using ingest reflexively.

Using filterpath

We can generate the list of files we don't know about using find and cdb filterpath:

find ${DIR} -type f -print0 | \
  cdb filterpath -1 -P -p out -0 -f '$u$z' > ${DB}.new-files0

We can then script computing those files' checksums and adding the new reports to the database:

xargs -0 sha512sum > ${DB}.new < ${DB}.new-files0
cdb addh < ${DB}.new

Using diff

For a different approach, we can quickly construct a "just paths" database, which associates all paths with a single digest, from the current state of the file system as follows:

cdb --db ${JPDB} init
find ${DIR} -type f -printf "0  %p\\0" | cdb --db ${JPDB} addh --inul

This database may not seem very useful, but when combined with cdb --db diff we can quickly find all paths whose checksums are unknown to the database:

cdb diff ${JPDB} --no-headers --flavor=path --which=super --format '$u$z' -0 > ${DB}.new-files0

And then proceed as above.

From Another Database

If we have another database that knows digests for our files, rather than computing digests again, we can extract checksums from ${DB2} and install them into ${DB}:

cdb --db ${DB2} look --inul < ${DB}.new-files0 | cdb --db ${DB} addh

Responding to File Moves

Armed with a "just paths" database as per the above, we can then direct the database to prune tracked paths not in the "just paths" database if the hashes are observed elsewhere:

cdb diff ${JPDB} --flavor=path --which=sub --no-headers --format '$u$z' --nul > ${JPDB}.missing-files0
cdb domv --inul < ${JPDB}.missing-files0
cdb gc > ${DB}.gc
sqlite3 ${DB} < ${DB}.gc

Find Duplicates

Given a path prefix (possibly empty), report all logged observations below that path of contents that exist in multiple locations (i.e., files with checksum collisions).

Remove Path

Cease to consider a particular path part of the database and remove all observations made of it. Since this application is primarily for data hoarders who tend not to delete things, one should prefer to :ref:`Respond to File Moves <Responding to File Moves>` rather than risk removing the last observation of a given hash.

Add Superseder

By Existing Paths

Indicate that some file contents are to be considered a lesser version of some other contents:

cdb addsuper /old/path /new/path

After this command is run, domv will be willing to remove the /old/path entry from the database. .. TODO

By Hashes

Superseder records can also be added from stdin using addsuperhash (or addsh). This command reads in lines of the form

old-digest new-digest notes

The notes field extends to the end of the record; if newlines are desired in the recorded notes, use --inul (-1) and separate records by NUL bytes.

Ingest

Given a digest stream, partition it into hashes already in the database and hashes novel to the database. For the former, optionally generate rm commands, and for the latter, optionally generate mv or cp commands to import into the library. Novel hashes, and their new paths, may optionally be recorded as well, to be subsequently added to the database:

find /source/path -type f -exec sha512sum {} \+ | \
  cdb ingest --target /new/path --prune

This will produce a stream of shell commands to copy files given by find into the /new/path directory (using their basename therein). Passing --move generates move rather than copy commands. Passing --prune additionally issues rm commands for source files whose hashes collide with something already in the database.

The --digest-log FILE option will cause import to write to FILE every new digest encountered in the stream, associated with its new name in /new/path. This can then be fed back through addhash without needing to recompute digests.

ingest knows how to quote paths for safe handling by POSIX shells (though its mechanism is somewhat crude and not always great for human consumption). However, POSIX shells are willing to forgive control characters in quoted strings while humans and terminals are more likely to make a mess of things. The --escape {posix,extended,human} option will change how ingest quotes such characters.

Reflexive Use of Ingest

The ingest command can also be used "reflexively" on the managed collection of files to either add files that are not tracked or prune files that have presence elsewhere in the database. We can enumerate files not tracked using filterpath and compute their checksums as we did in Add Missing Checksums above:

find ${DIR} -type f -print0 | \
cdb filterpath --in-path --predicate=out -0 -1 --format '$u$z' | \
xargs -0 sha512sum > ${DB}.new

We can then prepare to prune duplicates and add unique files:

cdb ingest --prune --inplace --digest-log ${DB}.new2 < ${DB}.new > ${DB}.prune

Add new files to the database with:

cdb addh < ${DB}.new2

Inspect the pruning commands to be run, and then execute them with:

sh < ${DB}.prune

(If you have, or might have, unusual path names, you may be better served with --prune-log rather than --prune. The resulting, NUL-terminated list of files can be inspected with cdb-util escape human -0 and run with xargs -0 -- rm --.) Other Included Utilities ########################

The cdb-util program contains utilities for manipulating digest streams and may grow to include other tools not directly relevant to manipulating cdb databases.

Digest Stream Utilities

digest-prefix

AKA dpre, this command filters a digest stream by adding a prefix to relative paths within. For example, while

sha512sum *

generates a stream that uses relative names, both of these forms should produce absolute names:

sha512sum $PWD/*
sha512sum * | cdb-util dpre --prefix $PWD

digest-filter-exists

AKA dfex, this command filters a digest stream, limiting it to files that exist. This may be useful if one is ingesting files in stages.

digest-relativize

AKA drel, this command is a "recursive digest-prefix": given a stream of names of digest stream files on stdin, this utility opens each and prefixes the paths therein by the path naming the stream. The various streams involved can be made NUL-terminated rather than newline terminated (with escapes) with:

  • The usual --nul (-0) continues to affect the output stream (stdout),
  • The usual --inul (-1) continues to affect the input stream of digest file names (stdin),
  • The new --fnul (-2) indicates that the digest files read internally are NUL-terminated records.

escape

AKA esc, this command maps input records in various ways to make them safe for consumption by shells or similar. This tool is largely a test hook, but is exposed in case it is useful. The --how parameter dictates the transform in question:

  • posix escapes strings such that they will be correctly inpterpreted by POSIX shells, using single-quotes whenever possible (except when escaping single quotes, which get escaped with double quotes). This transform will leave non-printing characters in place, including newlines!
  • extended escapes strings using $'\xHH' notation, understood by many *NIX shells. Non-printing and non-ASCII bytes are escaped, which can make this somewhat more invasive than might be desired.
  • human tries to escape strings using a somewhat messy, but Unicode-aware policy, preserving non-ASCII graphemes where possible, especially when names don't include shell metacharacters.
  • digest performs a GNU coreutils digest stream escaping.