Command-line client for the preservation asset tracking system (PATSy)
This project uses the "GitHub Flow" branching model (see https://confluence.umd.edu/display/LIB/GitHub+Flow+Model+for+Kubernetes+configuration+%28k8s-%29+repositories)
The "main" branch represents version 2 (v2) of the "patsy-db" codebase, which is incompatible with the "legacy" version 1 (v1) codebase.
Any work on the legacy patsy-db v1 application should be performed on the "main-v1" branch.
PATSy can be used with either a SQLite or Postgres database. The database to use is be specified by the "--database" command-line argument, or via a "PATSY_DATABASE" environment variable.
The "--help" flag provides information about available commands and arguments:
$ patsy --help
The "--help" flag, coupled with a specific command shows additional information about that command:
$ patsy load --help
The "--database" argument is used to specify the database the command should run against.
For SQLite databases, the argument is just the filename of the SQLite database file. For example, to run against a SQLite database in a "patsy-db.sqlite" file:
$ patsy --database patsy-db.sqlite <COMMAND>
where <COMMAND> is the PATSy command to run.
For Postgres databases, the argument is a database connection URL with the following format:
postgresql+psycopg2://<USER>:<PASSWORD>@<ADDRESS>:<PORT>/<DATABASE>
where:
- <USER> - the name of the database user
- <PASSWORD> - the database password for the user
- <ADDRESS> - the hostname/IP address of the database
- <PORT> - the port number (typically "5432" for Postgres)
- <DATABASE> - the name of the database to connect to, typically "patsy"
Note: The database specified in <DATABASE> must exist.
For example, to connect to a Postgres database named "patsy" on "localhost" with a username of "postgres", and a password of "password":
$ patsy --database postgresql+psycopg2://postgres:password@localhost:5432/patsy <COMMAND>
where <COMMAND> is the PATSy command to run.
The "--database" argument can be omitted if a "PATSY_DATABASE" environment variable has been defined.
For SQLite, simply specify the filename of the SQLite file. For example, for "patsy-db.sqlite":
$ export PATSY_DATABASE=patsy-db.sqlite
For Postgres, use the database connection URL. For example, using the Postgres database connection URL for a local Docker container:
$ export PATSY_DATABASE=postgresql+psycopg2://postgres:password@localhost:5432/patsy
The "--database" argument can still be passed in to temporarily override the environment variable.
Loads an "inventory" CSV file into the database.
$ patsy --database <DATABASE> load <INVENTORY_CSV_FILE>
The "inventory" CSV file is typically generated by the "preserve" (https://github.com/umd-lib/preserve) or "aws-archiver" (https://github.com/umd-lib/aws-archiver) tools.
Each row of the CSV file contains information about the batch, accession, and (optionally) location information about a preservation asset.
An accession is uniquely identified by its batch, and relative path (relpath) fields. If an accession already exists, the "load" command will NOT update the accession (even if the accession information in the CSV file is different). Location information for an existing accession will be added, unless the location already exists for that accession.
Retrieves checksums (MD5 (default), SHA1, or SHA256) for one or more accessions, looked up by storage location.
$ patsy --database <DATABASE> checksum [--md5|--sha1|--sha256] [LOCATION [LOCATIONS...]]
Creates output like this:
088be3fe9a8fd2a7e70e66a602828766 libdc-archivebucket-17lowbw7m2av1/Archive000Florence/Florence.mpg
fe84e91e0a06906773a5c19a2e9620d9 libdc-archivebucket-17lowbw7m2av1/Archive000Football1/19461130-FB-002-2Qtr.mpg
9876f8c92e16b73c662a39b23409d0a0 libdc-archivebucket-17lowbw7m2av1/Archive000Football1/19461130-FB-003-2Half.mpg
Instead of listing locations on the command line, the checksum command
also accepts a CSV file with columns "location" and "destination". If
the "destination" is present, it is used for the second column. Assuming
that the "destination" refers to an actual path on a local file
system, this output can then be fed to md5sum -c
(or other
algorithm-appropriate checksum verification tool).
$ patsy --database <DATABASE> checksum [--md5|--sha1|--sha256] --file <CSV_FILE>
Store locations of accessions in ApTrust to PATSy.
$ patsy --database <DATABASE> sync --name <API-NAME> --key <API-KEY> [--timebefore|--timeafter] <YEAR-MONTH-DAY>
The <API-NAME> and <API-KEY> are the X-Pharos-API-User and X-Pharos_API-Key respectively. They are required for the command. These can be provided in the command or can be added as command-line arguments:
$ export X_PHAROS_NAME=your-email
$ export X_PHAROS_KEY=the-secret-key
Providing the parameters has more priority than providing them as command-line arguments.
The "timebefore" and "timeafter" parameters are dates provided to specify what bags to access from ApTrust. The dates should be formatted in "year-month-day" format (####-##-##).
Update field in accession records, based on values in a CSV file.
$ patsy --database <DATABASE> update --batch <BATCH> \
--db-compare-column <DB_COMPARE_COLUMN> --db-target-column <DB_TARGET_COLUMN> \
--csv-compare-value <CSV_COMPARE_VALUE> --csv-update-value <CSV_UPDATE_VALUE> \
<CSV_FILE>
where (in addition to the "<DATABASE>" field described above):
- <BATCH> - the name of the batch to update
- <DB_COMPARE_COLUMN> - the "accessions" database table column to compare values in the CSV file against
- <DB_TARGET_COLUMN> - the "accessions" database table column to update
- <CSV_COMPARE_VALUE> - the column in the CSV file containing the comparison values
- <CSV_UPDATE_VALUE> - the column in the CSV file containing the update values
- <CSV_FILE> - a CSV file containing columns labelled <CSV_COMPARE_VALUE> and <CSV_UPDATE_VALUE> containing the comparison and updated values. An example is the output of the "partition" command from the "preserve" tool.
All fields are required.
There are three additional optional flags:
--dry-run
- runs the command, but does not update the database--skip-existing
- only updates the database "target" field if the current value is NULL or the empty string.--verbose
- Provides additional information about individual updates
One use case for "update" is updating the "relpath" of accessions, based on the CSV file generated by the "partition" command in the "preserve" toolkit (https://github.com/umd-lib/preserve).
The "partition" command produces a CSV file similar to the following:
relpath_old,relpath_new
sample/foo.pdf,partioned/foo.pdf
...
Assuming that the CSV file generated by the "partition" command is named "partition-updates.csv", to update entries in the "EXAMPLE_BATCH" batch, the "update" command would be:
$ patsy update --batch EXAMPLE_BATCH \
--db-compare-column relpath --db-target-column relpath \
--csv-compare-value relpath_old --csv-update-value relpath_new \
partition-updates.csv
Note that in this case, the "--db-compare-column" and "--db-target-column" parameters are the same ("relpath", as that is the name of the column in the "accessions" database table). The "--csv-compare-value" and "--csv-update-value" parameters use the header field names from the CSV file.
Another use case for "update" is to update accessions that only have an "md5" hash value with additional hash values such as "sha256".
In this case, a CSV file such as the following might be created:
md5sum,sha256sum
<MD5SUM_VALUE>,<SHA256SUM_VALUE>
...
Assuming that the CSV file is named "update_sha256_based_on_md5.csv", to update entries in the "EXAMPLE_BATCH" batch, the "update" command would be:
$ patsy update --batch EXAMPLE_BATCH \
--db-compare-column md5 --db-target-column sha256 \
--csv-compare-value md5sum --csv-update-value sha256sum \
--skip-existing \
update_sha256_based_on_md5.csv
Note that in this case, the "--db-compare-column" and "--db-target-column" parameters are the different (the "md5" column in the "accessions" database table is matched to the "md5sum" value in the CSV file, and the "sha256" in "accessions" is updated by the "sha256sum" value from the CSV file). The "--skip-existing" flag ensures that existing "sha256" values in the database will not be updated.
PATSy uses the "Alembic" (https://alembic.sqlalchemy.org/en/latest/) migration tool to handle updates to the database schema.
Specifying the database must be done either by:
-
Setting up the "PATSY_DATABASE" environment variable, or
-
Running the "alembic" command with the "-x database=<DATABASE>" flag, with <DATABASE> being the same value passed to the "--database" argument for PATSy commands. For example:
$ alembic -x database=postgresql+psycopg2://postgres:password@localhost:5432/patsy upgrade head
The alembic command will fail with a "patsy.database.DatabaseNotSetError" error if the database has not been set.
To run the database migrations, or to set up the database for the first time:
$ alembic upgrade head
$ alembic revision -m "<MIGRATION_DESCRIPTION>"
where <MIGRATION_DESCRIPTION> is a short description of the migration, such as "create local types table". This description will be included as part of the migration filename.
Note: Auto-generating migrations has significant limitations. See https://alembic.sqlalchemy.org/en/latest/autogenerate.html.
Alembic can attempt to "auto-generate" migrations. To do this, first modify the "declarative base" in "patsy/model.py", and then run the following command:
$ alembic revision --autogenerate -m "<MIGRATION_DESCRIPTION>"
where <MIGRATION_DESCRIPTION> is a short description of the migration, such as "create local types table". This description will be included as part of the migration filename.
As discussed in https://alembic.sqlalchemy.org/en/latest/batch.html#running-batch-migrations-for-sqlite-and-other-databases, SQLite does not support the "ALTER" statement for changing the database schema.
In PATSy, the Alembic "autogenerate" functionality has been configured to create "batch" migrations, which use a "move and copy" workflow for SQLite, where:
the existing table structure is reflected from the database, a new version of this table is created with the given changes, data is copied from the old table to the new table using “INSERT from SELECT”, and finally the old table is dropped and the new one renamed to the original name.
This generally works, except when a table is used by a SQL View. For example, the "locations" table is used by the "patsy_records" view. In these cases, the Postgres migration will succeed, but a SQLite migration (including those used by the tests) will fail with an error such as:
sqlite3.OperationalError: error in view patsy_records: no such table: main.locations
To enable these migrations to work with SQLite, the affect View must be dropped prior to the migration, and restored after the migration:
-
In the "upgrade" method:
a) At the start of the method, drop the view, i.e.:
op.execute("DROP VIEW IF EXISTS patsy_records")
b) Perform the migration (typically the
with op.batch_alter_table ... as
block that was autogenerated)c) At the end of the method, restore the view. This will typically require retrieving the "ReplaceableObject" from the revision where the view was last created/updated. For example, for the "patsy_records_view" ReplaceableObject that was last modified in the "f693a44bd7fe" revision the code would be:
# Restore the "patsy_records" view prior_view_module = op.get_context().script.get_revision('f693a44bd7fe').module obj = getattr(prior_view_module, 'patsy_records_view') batch_op.create_view(obj)
-
In the "downgrade method, the same code changes as above are needed, i.e., drop the view, perform the migration, then restore the view.
An example of these changes is in alembic/versions/1e2e3bd85d6d_libitd_2254_add_locations_storage_.py
See the LICENSE file for license rights and limitations.