What is this?
Resurrecting an ancient project, a "mostly read-only file management tool". It's intended for keeping a large list of checksums in a database so that duplication, movement, and corruption of files can be detected. In addition to maintaining a singular database, it also offers cross-database functionality.
We speak of "observations" to mean an association of a file path and its contents (or, at least, their cryptographic checksum). Most operations on the checksum database pertain to one or more observations.
Theory of Operation
This program is just a shim around a database; it does not interact with the
filesystem much itself. Instead, it should be used in composition with things
like find
and the GNU coreutils digest programs (e.g. sha512sum
),
delegating details of filesystem traversal and choice of hash and so on to the
user.
Dependencies
This program requires...
- either the Lua 5.3 interpreter or luajit,
- the Lua
argparse
andpenlight
libraries, and lua-dbi
and itslua-dbi-sqlite3
driver.
Supported Operations
To reduce clutter, many of the examples here rely on cdb
's ability to pull
the default database from the $CDB
environment variable. If that's not what
you want, add --db ${DB}
to the invocation of cdb
.
Initialize A Database
cdb init
Observe A Path
Add the checksum of a single path to the database. This will create a new checksum and/or a new path identifier as needed and will bind them together.
sha512sum $FILE | cdb addh
Or, for all files under a path:
find $DIR -type f -exec sha512sum {} \+ | cdb addh
If we have a pile of digest files already, each of which contains digests of
paths relative to its location, we can generate a database, ${DB2}
from
them with the assistance of the cdb-util digest-relativize
tool (see
:ref:`below <cdb-util_drel>`):
find ${DIR} -type f -name SHA512SUMS -print0 | cdb-util drel -1 | cdb addh
Revalidate A Path Observation
Measure the checksum of a path and confirm that the database already held that observation. Reports unexpected files as well as mis-checksummed contents.
sha512sum $FILE | cdb verh
Or for all files under a path:
find $DIR -type f -exec sha512sum {} \; | cdb verh
This processing of digest streams is to be preferred to verifying a digest stream as generated by the database, e.g.:
cdb look \* | sha512sum -c
because the former can be more informative in the case of mismatching digests (specifically, the database can look for other paths that have the reported digest). If it's easier to have the database generate the set of files, that can be done:
cdb look \* --format '$u$z' --nul | xargs -0 sha512sum | cdb verh
Add Missing Checksums
We can augment a database of files by filtering a list of files we have to
exclude the list of files we know about. If, however, there is a possibility
that some of these files are duplicates of ones already in the database, you may
be better off using ingest
reflexively.
Using filterpath
We can generate the list of files we don't know about using find
and
cdb filterpath
:
find ${DIR} -type f -print0 | \ cdb filterpath -1 -P -p out -0 -f '$u$z' > ${DB}.new-files0
We can then script computing those files' checksums and adding the new reports to the database:
xargs -0 sha512sum > ${DB}.new < ${DB}.new-files0 cdb addh < ${DB}.new
Using diff
For a different approach, we can quickly construct a "just paths" database, which associates all paths with a single digest, from the current state of the file system as follows:
cdb --db ${JPDB} init find ${DIR} -type f -printf "0 %p\\0" | cdb --db ${JPDB} addh --inul
This database may not seem very useful, but when combined with cdb --db diff
we
can quickly find all paths whose checksums are unknown to the database:
cdb diff ${JPDB} --no-headers --flavor=path --which=super --format '$u$z' -0 > ${DB}.new-files0
And then proceed as above.
From Another Database
If we have another database that knows digests for our files, rather than
computing digests again, we can extract checksums from ${DB2}
and install
them into ${DB}
:
cdb --db ${DB2} look --inul < ${DB}.new-files0 | cdb --db ${DB} addh
Responding to File Moves
Armed with a "just paths" database as per the above, we can then direct the database to prune tracked paths not in the "just paths" database if the hashes are observed elsewhere:
cdb diff ${JPDB} --flavor=path --which=sub --no-headers --format '$u$z' --nul > ${JPDB}.missing-files0 cdb domv --inul < ${JPDB}.missing-files0 cdb gc > ${DB}.gc sqlite3 ${DB} < ${DB}.gc
Find Duplicates
Given a path prefix (possibly empty), report all logged observations below that path of contents that exist in multiple locations (i.e., files with checksum collisions).
Remove Path
Cease to consider a particular path part of the database and remove all observations made of it. Since this application is primarily for data hoarders who tend not to delete things, one should prefer to :ref:`Respond to File Moves <Responding to File Moves>` rather than risk removing the last observation of a given hash.
Add Superseder
By Existing Paths
Indicate that some file contents are to be considered a lesser version of some other contents:
cdb addsuper /old/path /new/path
After this command is run, domv
will be willing to remove the /old/path
entry from the database.
.. TODO
By Hashes
Superseder records can also be added from stdin
using addsuperhash
(or
addsh
). This command reads in lines of the form
old-digest new-digest notes
The notes
field extends to the end of the record; if newlines are desired in
the recorded notes, use --inul
(-1
) and separate records by NUL bytes.
Ingest
Given a digest stream, partition it into hashes already in the database and
hashes novel to the database. For the former, optionally generate rm
commands, and for the latter, optionally generate mv
or cp
commands
to import into the library. Novel hashes, and their new paths, may optionally
be recorded as well, to be subsequently added to the database:
find /source/path -type f -exec sha512sum {} \+ | \ cdb ingest --target /new/path --prune
This will produce a stream of shell commands to copy files given by find
into the /new/path
directory (using their basename therein). Passing
--move
generates move rather than copy commands. Passing --prune
additionally issues rm
commands for source files whose hashes collide with
something already in the database.
The --digest-log FILE
option will cause import
to write to FILE every
new digest encountered in the stream, associated with its new name in
/new/path
. This can then be fed back through addhash
without needing to
recompute digests.
ingest
knows how to quote paths for safe handling by POSIX shells (though
its mechanism is somewhat crude and not always great for human consumption).
However, POSIX shells are willing to forgive control characters in quoted
strings while humans and terminals are more likely to make a mess of things.
The --escape {posix,extended,human}
option will change how ingest
quotes
such characters.
Reflexive Use of Ingest
The ingest
command can also be used "reflexively" on the managed collection
of files to either add files that are not tracked or prune files that have
presence elsewhere in the database. We can enumerate files not tracked using
filterpath
and compute their checksums as we did in Add Missing Checksums
above:
find ${DIR} -type f -print0 | \ cdb filterpath --in-path --predicate=out -0 -1 --format '$u$z' | \ xargs -0 sha512sum > ${DB}.new
We can then prepare to prune duplicates and add unique files:
cdb ingest --prune --inplace --digest-log ${DB}.new2 < ${DB}.new > ${DB}.prune
Add new files to the database with:
cdb addh < ${DB}.new2
Inspect the pruning commands to be run, and then execute them with:
sh < ${DB}.prune
(If you have, or might have, unusual path names, you may be better served with
--prune-log
rather than --prune
. The resulting, NUL
-terminated list
of files can be inspected with cdb-util escape human -0
and run with xargs
-0 -- rm --
.)
Other Included Utilities
########################
The cdb-util
program contains utilities for manipulating digest streams and
may grow to include other tools not directly relevant to manipulating cdb
databases.
Digest Stream Utilities
digest-prefix
AKA dpre
, this command filters a digest stream by adding a prefix to
relative paths within. For example, while
sha512sum *
generates a stream that uses relative names, both of these forms should produce absolute names:
sha512sum $PWD/* sha512sum * | cdb-util dpre --prefix $PWD
digest-filter-exists
AKA dfex
, this command filters a digest stream, limiting it to files that
exist. This may be useful if one is ingesting files in stages.
digest-relativize
AKA drel
, this command is a "recursive digest-prefix
": given a stream
of names of digest stream files on stdin
, this utility opens each and
prefixes the paths therein by the path naming the stream. The various streams
involved can be made NUL
-terminated rather than newline terminated (with
escapes) with:
- The usual
--nul
(-0
) continues to affect the output stream (stdout
), - The usual
--inul
(-1
) continues to affect the input stream of digest file names (stdin
), - The new
--fnul
(-2
) indicates that the digest files read internally areNUL
-terminated records.
escape
AKA esc
, this command maps input records in various ways to make them safe
for consumption by shells or similar. This tool is largely a test hook, but is
exposed in case it is useful. The --how
parameter dictates the transform
in question:
posix
escapes strings such that they will be correctly inpterpreted by POSIX shells, using single-quotes whenever possible (except when escaping single quotes, which get escaped with double quotes). This transform will leave non-printing characters in place, including newlines!extended
escapes strings using$'\xHH'
notation, understood by many *NIX shells. Non-printing and non-ASCII bytes are escaped, which can make this somewhat more invasive than might be desired.human
tries to escape strings using a somewhat messy, but Unicode-aware policy, preserving non-ASCII graphemes where possible, especially when names don't include shell metacharacters.digest
performs a GNU coreutils digest stream escaping.