/patsy

Command-line client for preservation asset tracking system (PATSy)

Primary LanguagePythonApache License 2.0Apache-2.0

patsy

Command-line client for the preservation asset tracking system (PATSy)

Project Branches

This project uses the "GitHub Flow" branching model (see https://confluence.umd.edu/display/LIB/GitHub+Flow+Model+for+Kubernetes+configuration+%28k8s-%29+repositories)

The "main" branch represents version 2 (v2) of the "patsy-db" codebase, which is incompatible with the "legacy" version 1 (v1) codebase.

Any work on the legacy patsy-db v1 application should be performed on the "main-v1" branch.

Development Setup

See docs/DevelopmentSetup.md.

Database Setup

PATSy can be used with either a SQLite or Postgres database. The database to use is be specified by the "--database" command-line argument, or via a "PATSY_DATABASE" environment variable.

PATSy Commands

The "--help" flag provides information about available commands and arguments:

$ patsy --help

The "--help" flag, coupled with a specific command shows additional information about that command:

$ patsy load --help

Common Arguments

"database" argument

The "--database" argument is used to specify the database the command should run against.

SQLite

For SQLite databases, the argument is just the filename of the SQLite database file. For example, to run against a SQLite database in a "patsy-db.sqlite" file:

$ patsy --database patsy-db.sqlite <COMMAND>

where <COMMAND> is the PATSy command to run.

Postgres

For Postgres databases, the argument is a database connection URL with the following format:

postgresql+psycopg2://<USER>:<PASSWORD>@<ADDRESS>:<PORT>/<DATABASE>

where:

  • <USER> - the name of the database user
  • <PASSWORD> - the database password for the user
  • <ADDRESS> - the hostname/IP address of the database
  • <PORT> - the port number (typically "5432" for Postgres)
  • <DATABASE> - the name of the database to connect to, typically "patsy"

Note: The database specified in <DATABASE> must exist.

For example, to connect to a Postgres database named "patsy" on "localhost" with a username of "postgres", and a password of "password":

$ patsy --database postgresql+psycopg2://postgres:password@localhost:5432/patsy <COMMAND>

where <COMMAND> is the PATSy command to run.

PATSY_DATABASE environment variable

The "--database" argument can be omitted if a "PATSY_DATABASE" environment variable has been defined.

For SQLite, simply specify the filename of the SQLite file. For example, for "patsy-db.sqlite":

$ export PATSY_DATABASE=patsy-db.sqlite

For Postgres, use the database connection URL. For example, using the Postgres database connection URL for a local Docker container:

$ export PATSY_DATABASE=postgresql+psycopg2://postgres:password@localhost:5432/patsy

The "--database" argument can still be passed in to temporarily override the environment variable.

"load" command

Loads an "inventory" CSV file into the database.

$ patsy --database <DATABASE> load <INVENTORY_CSV_FILE>

The "inventory" CSV file is typically generated by the "preserve" (https://github.com/umd-lib/preserve) or "aws-archiver" (https://github.com/umd-lib/aws-archiver) tools.

Each row of the CSV file contains information about the batch, accession, and (optionally) location information about a preservation asset.

An accession is uniquely identified by its batch, and relative path (relpath) fields. If an accession already exists, the "load" command will NOT update the accession (even if the accession information in the CSV file is different). Location information for an existing accession will be added, unless the location already exists for that accession.

"checksum" command

Retrieves checksums (MD5 (default), SHA1, or SHA256) for one or more accessions, looked up by storage location.

$ patsy --database <DATABASE> checksum [--md5|--sha1|--sha256] [LOCATION [LOCATIONS...]]

Creates output like this:

088be3fe9a8fd2a7e70e66a602828766  libdc-archivebucket-17lowbw7m2av1/Archive000Florence/Florence.mpg
fe84e91e0a06906773a5c19a2e9620d9  libdc-archivebucket-17lowbw7m2av1/Archive000Football1/19461130-FB-002-2Qtr.mpg
9876f8c92e16b73c662a39b23409d0a0  libdc-archivebucket-17lowbw7m2av1/Archive000Football1/19461130-FB-003-2Half.mpg

Instead of listing locations on the command line, the checksum command also accepts a CSV file with columns "location" and "destination". If the "destination" is present, it is used for the second column. Assuming that the "destination" refers to an actual path on a local file system, this output can then be fed to md5sum -c (or other algorithm-appropriate checksum verification tool).

$ patsy --database <DATABASE> checksum [--md5|--sha1|--sha256] --file <CSV_FILE>

"sync" command

Store locations of accessions in ApTrust to PATSy.

$ patsy --database <DATABASE> sync --name <API-NAME> --key <API-KEY> [--timebefore|--timeafter] <YEAR-MONTH-DAY>

The <API-NAME> and <API-KEY> are the X-Pharos-API-User and X-Pharos_API-Key respectively. They are required for the command. These can be provided in the command or can be added as command-line arguments:

$ export X_PHAROS_NAME=your-email
$ export X_PHAROS_KEY=the-secret-key

Providing the parameters has more priority than providing them as command-line arguments.

The "timebefore" and "timeafter" parameters are dates provided to specify what bags to access from ApTrust. The dates should be formatted in "year-month-day" format (####-##-##).

"update" command

Update field in accession records, based on values in a CSV file.

$ patsy --database <DATABASE> update --batch <BATCH> \
        --db-compare-column <DB_COMPARE_COLUMN> --db-target-column <DB_TARGET_COLUMN> \
        --csv-compare-value <CSV_COMPARE_VALUE> --csv-update-value <CSV_UPDATE_VALUE> \
        <CSV_FILE>

where (in addition to the "<DATABASE>" field described above):

  • <BATCH> - the name of the batch to update
  • <DB_COMPARE_COLUMN> - the "accessions" database table column to compare values in the CSV file against
  • <DB_TARGET_COLUMN> - the "accessions" database table column to update
  • <CSV_COMPARE_VALUE> - the column in the CSV file containing the comparison values
  • <CSV_UPDATE_VALUE> - the column in the CSV file containing the update values
  • <CSV_FILE> - a CSV file containing columns labelled <CSV_COMPARE_VALUE> and <CSV_UPDATE_VALUE> containing the comparison and updated values. An example is the output of the "partition" command from the "preserve" tool.

All fields are required.

There are three additional optional flags:

  • --dry-run - runs the command, but does not update the database
  • --skip-existing - only updates the database "target" field if the current value is NULL or the empty string.
  • --verbose - Provides additional information about individual updates

"relpath" Updates

One use case for "update" is updating the "relpath" of accessions, based on the CSV file generated by the "partition" command in the "preserve" toolkit (https://github.com/umd-lib/preserve).

The "partition" command produces a CSV file similar to the following:

relpath_old,relpath_new
sample/foo.pdf,partioned/foo.pdf
...

Assuming that the CSV file generated by the "partition" command is named "partition-updates.csv", to update entries in the "EXAMPLE_BATCH" batch, the "update" command would be:

$ patsy update --batch EXAMPLE_BATCH \
        --db-compare-column relpath --db-target-column relpath \
        --csv-compare-value relpath_old --csv-update-value relpath_new \
        partition-updates.csv

Note that in this case, the "--db-compare-column" and "--db-target-column" parameters are the same ("relpath", as that is the name of the column in the "accessions" database table). The "--csv-compare-value" and "--csv-update-value" parameters use the header field names from the CSV file.

checksum Updates

Another use case for "update" is to update accessions that only have an "md5" hash value with additional hash values such as "sha256".

In this case, a CSV file such as the following might be created:

md5sum,sha256sum
<MD5SUM_VALUE>,<SHA256SUM_VALUE>
...

Assuming that the CSV file is named "update_sha256_based_on_md5.csv", to update entries in the "EXAMPLE_BATCH" batch, the "update" command would be:

$ patsy update --batch EXAMPLE_BATCH \
        --db-compare-column md5 --db-target-column sha256 \
        --csv-compare-value md5sum --csv-update-value sha256sum \
        --skip-existing \
        update_sha256_based_on_md5.csv

Note that in this case, the "--db-compare-column" and "--db-target-column" parameters are the different (the "md5" column in the "accessions" database table is matched to the "md5sum" value in the CSV file, and the "sha256" in "accessions" is updated by the "sha256sum" value from the CSV file). The "--skip-existing" flag ensures that existing "sha256" values in the database will not be updated.

Database Migrations

PATSy uses the "Alembic" (https://alembic.sqlalchemy.org/en/latest/) migration tool to handle updates to the database schema.

Specifying the database must be done either by:

  • Setting up the "PATSY_DATABASE" environment variable, or

  • Running the "alembic" command with the "-x database=<DATABASE>" flag, with <DATABASE> being the same value passed to the "--database" argument for PATSy commands. For example:

    $ alembic -x database=postgresql+psycopg2://postgres:password@localhost:5432/patsy upgrade head

The alembic command will fail with a "patsy.database.DatabaseNotSetError" error if the database has not been set.

Run the database migrations

To run the database migrations, or to set up the database for the first time:

$ alembic upgrade head

Adding a new migration

$ alembic revision -m "<MIGRATION_DESCRIPTION>"

where <MIGRATION_DESCRIPTION> is a short description of the migration, such as "create local types table". This description will be included as part of the migration filename.

Auto-generating migrations

Note: Auto-generating migrations has significant limitations. See https://alembic.sqlalchemy.org/en/latest/autogenerate.html.

Alembic can attempt to "auto-generate" migrations. To do this, first modify the "declarative base" in "patsy/model.py", and then run the following command:

$ alembic revision --autogenerate -m "<MIGRATION_DESCRIPTION>"

where <MIGRATION_DESCRIPTION> is a short description of the migration, such as "create local types table". This description will be included as part of the migration filename.

Migrations and Views

As discussed in https://alembic.sqlalchemy.org/en/latest/batch.html#running-batch-migrations-for-sqlite-and-other-databases, SQLite does not support the "ALTER" statement for changing the database schema.

In PATSy, the Alembic "autogenerate" functionality has been configured to create "batch" migrations, which use a "move and copy" workflow for SQLite, where:

the existing table structure is reflected from the database, a new version of this table is created with the given changes, data is copied from the old table to the new table using “INSERT from SELECT”, and finally the old table is dropped and the new one renamed to the original name.

This generally works, except when a table is used by a SQL View. For example, the "locations" table is used by the "patsy_records" view. In these cases, the Postgres migration will succeed, but a SQLite migration (including those used by the tests) will fail with an error such as:

sqlite3.OperationalError: error in view patsy_records: no such table: main.locations

To enable these migrations to work with SQLite, the affect View must be dropped prior to the migration, and restored after the migration:

  1. In the "upgrade" method:

    a) At the start of the method, drop the view, i.e.:

    op.execute("DROP VIEW IF EXISTS patsy_records")
    

    b) Perform the migration (typically the with op.batch_alter_table ... as block that was autogenerated)

    c) At the end of the method, restore the view. This will typically require retrieving the "ReplaceableObject" from the revision where the view was last created/updated. For example, for the "patsy_records_view" ReplaceableObject that was last modified in the "f693a44bd7fe" revision the code would be:

    # Restore the "patsy_records" view
    prior_view_module = op.get_context().script.get_revision('f693a44bd7fe').module
    obj = getattr(prior_view_module, 'patsy_records_view')
    batch_op.create_view(obj)
    
  2. In the "downgrade method, the same code changes as above are needed, i.e., drop the view, perform the migration, then restore the view.

An example of these changes is in alembic/versions/1e2e3bd85d6d_libitd_2254_add_locations_storage_.py

License

See the LICENSE file for license rights and limitations.