Cornell University Library Archival Repository Storage Manifest Specification

Purpose

The primary purpose is to support workflows that move and copy packages (and the files that comprise them), replicate them, and verify their fixity and completeness.

The secondary purpose is to enable some very basic management tasks, for which we need to know ownership/stewardship information (see collection_id, depository, steward), links to system identifiers and collection documentation (see bibid, local_id, documentation), and basic file information (see size, ingest_date, filetype). More sophisticated management tasks will rely on other services and on descriptive and technical metadata not included in the manifest.

Descriptive and technical metadata

Manifests will not include descriptive and certain technical metadata, this will be either:

in the package (in some way that we can find by inspecting the package --> require local standard)
in a linked reference system (connected via bibid, local_id and/or package_id)

Other services (e.g. disovery, dissemination) will need to extract/access this information to understand more extensive item and file metadata.

Access and usage rights information

Manifests will not include access and usage rights information, they will link to such information via the collection level documentation property. At this point, we are not providing machine-actionable, item-level or package-level rights information for collection assets.

Manifest format

A manifest is a JSON document which includes specific details at the collection, package, and item level for digital asssets deposited into CULAR. At the top-level it is an array of collection objects, each of which has one or more package objects, each of which has one or more file objects.

The manifest is created in two stages. The first stage, the "ingest manifest" lists all of the files being furnished; optionally, fixity information for those files; how files are arranged into packages; and basic collection identification information. At this stage, the CULAR application ensure that all the files in the source directory are referenced in the "ingest manifest", only the files referenced in the "ingest manifest" exist in the source directory, and updates the source_path field so that the absolute path for each file can be determined for transfer. The requirements for this stage of the manifest are listed in the table below, under the column labeled "Ingest Requirements".

The CULAR application generates a "storage manifest" from the "ingest manifest" after the ingest (transfer, fixity check) is complete. For each file referenced in the "ingest manifest", the "storage manifest" populates the ingest_date, tool_version, and media_type. For new collections, the location field is added; for existing collections, the location field is appended to when appropriate. Any field listed as optional or not-allowed under "Ingest Requirements" and required under "Storage Requirements" will be filled in during this stage. The requirements for the "storage manifest" are listed in the table below, under the column labeled "Storage Requirements".

For examples, see example manifest JSON for ingest and example manifest JSON for storage.

Collection properties

Property	Ingest Requirements	Storage Requirements	Description
`collection_id`	required	required	The intellectual aggregation as assembled by the steward acting as depositor. In the case of RMC entities, use Archival Collection IDs. If collection is not archival, but cataloged, use BibID. Must be provided if available. Examples: `RMM06885` (Bolivian Pamphlets), `RMA03590` (Cornell Hockey Films), `5780-156` (Kheel). Primarily letters and numbers, case sensitive, may contain a space, dash or underscore, must not contain a `/`.
`depositor`	required	required	The subject area designation driven off the area list and Archival units (`RMC/RMM`, `RMC/RMA`, `Kheel`, `ILR`, `Music`, etc).
`steward`	required	required	The netID of the Digital Collection steward. String must match netID pattern.
`documentation`	required	required	A pointer to where to find collection-level documentation (i.e., CULAR PID).
`locations`	not-allowed	required	An array of base URI locations where every package described in this manifest in this collection is stored or to be stored.
`packages`	required	required	Array of package objects
`number_packages`	optional	required	The number of entries in the `packages` array, allows self-checking for consistency if present. An integer.

Package properties

Each object in the packages array may have the following properties:

Property	Required/Optional for Ingest	Required/Optional for Storage	Description
`package_id`	required	required	URI identifier for the package. MUST be unique within Cornell collections so that it can be used as the primary key for access to packages. Use UUID in URI form, e.g. `urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6` (following RFC4122 and IANA) for all packages.
`source_path`	required	not-allowed	Must be left blank in ingest manifest and is used by ingest code. Value not retained in storage manifest.
`bibid`	optional	optional	Bibliographic record id this package is associated with, SHOULD be provided if available. (Note that this value is intended for identifying the bibliographic record of the assets specific to this package, rather than for the collection as a whole.)
`local_id`	optional	optional	Physical item identifier, SHOULD be provided by depositor, if available.
`files`	required	required	An array of objects describing each file/object in the manifest. We use `files` even though they are `objects/resources` in some storage technologies like AWS S3.
`number_files`	optional	required	The number of entries in the `files` array, allows self-checking for consistency if present. An integer.

File properties

Each object in the files array may have the following properties:

Property	Required/Optional for Ingest	Required/Optional for Storage	Description
`filepath`	required	required	Path and filename of the file within the package. The character `/` MUST be used as a path separator (not `\` as is used on Windows systems). Following Bagit, if a `filepath` includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986]
`sha1`	optional	required	SHA-1 hash of data (hex encoded using lowercase alphas, same as output from `sha1sum`, e.g. `021ea82f0468043e81a734b1342b1e64904672b0`). If present for ingest, it will be verified; otherwise it will be calculated by ingest code.
`md5`	optional	optional	MD5 hash of data (hex encoded using lowercase alphas, same as output from `md5sum`, e.g. `d41d8cd98f00b204e9800998ecf8427e`). May or may not be present on ingest, will be verified and retained if present
`size`	optional	required	Size of the file in bytes, an integer value. If not present for ingest, will be calculated by ingest code.
`ingest_date`	not-allowed	required	Date of ingest of the file.
`tool_version`	required	required	Must be left blank in ingest manifest. String representing the tool and version of the file identification utility run. (e.g., `tika-2.1.0`)
`media_type`	required	required	Must be left blank in ingest manifest. The media type of the file referenced by `filepath` using the tool referenced in `tool_version`.

cul-it/cular-metadata