The primary purpose is to support workflows that move and copy packages (and the files that comprise them), replicate them, and verify their fixity and completeness.
The secondary purpose is to enable some very basic management tasks, for which we need to know ownership/stewardship information (see collection_id
, depository
, steward
), links to system identifiers and collection documentation (see bibid
, local_id
, documentation
), and basic file information (see size
, ingest_date
, filetype
). More sophisticated management tasks will rely on other services and on descriptive and technical metadata not included in the manifest.
Manifests will not include descriptive and certain technical metadata, this will be either:
- in the package (in some way that we can find by inspecting the package --> require local standard)
- in a linked reference system (connected via
bibid
,local_id
and/orpackage_id
)
Other services (e.g. disovery, dissemination) will need to extract/access this information to understand more extensive item and file metadata.
Manifests will not include access and usage rights information, they will link to such information via the collection level documentation
property. At this point, we are not providing machine-actionable, item-level or package-level rights information for collection assets.
A manifest is a JSON document which includes specific details at the collection, package, and item level for digital asssets deposited into CULAR. At the top-level it is an array of collection objects, each of which has one or more package objects, each of which has one or more file objects.
The manifest is created in two stages. The first stage, the "ingest manifest" lists all of the files being furnished; optionally, fixity information for those files; how files are arranged into packages; and basic collection identification information. At this stage, the CULAR application ensure that all the files in the source directory are referenced in the "ingest manifest", only the files referenced in the "ingest manifest" exist in the source directory, and updates the source_path
field so that the absolute path for each file can be determined for transfer. The requirements for this stage of the manifest are listed in the table below, under the column labeled "Ingest Requirements".
The CULAR application generates a "storage manifest" from the "ingest manifest" after the ingest (transfer, fixity check) is complete. For each file referenced in the "ingest manifest", the "storage manifest" populates the ingest_date
, tool_version
, and media_type
. For new collections, the location
field is added; for existing collections, the location
field is appended to when appropriate. Any field listed as optional or not-allowed under "Ingest Requirements" and required under "Storage Requirements" will be filled in during this stage. The requirements for the "storage manifest" are listed in the table below, under the column labeled "Storage Requirements".
For examples, see example manifest JSON for ingest and example manifest JSON for storage.
Property | Ingest Requirements | Storage Requirements | Description |
---|---|---|---|
collection_id |
required | required | The intellectual aggregation as assembled by the steward acting as depositor. In the case of RMC entities, use Archival Collection IDs. If collection is not archival, but cataloged, use BibID. Must be provided if available. Examples: RMM06885 (Bolivian Pamphlets), RMA03590 (Cornell Hockey Films), 5780-156 (Kheel). Primarily letters and numbers, case sensitive, may contain a space, dash or underscore, must not contain a / . |
depositor |
required | required | The subject area designation driven off the area list and Archival units (RMC/RMM , RMC/RMA , Kheel , ILR , Music , etc). |
steward |
required | required | The netID of the Digital Collection steward. String must match netID pattern. |
documentation |
required | required | A pointer to where to find collection-level documentation (i.e., CULAR PID). |
locations |
not-allowed | required | An array of base URI locations where every package described in this manifest in this collection is stored or to be stored. |
packages |
required | required | Array of package objects |
number_packages |
optional | required | The number of entries in the packages array, allows self-checking for consistency if present. An integer. |
Each object in the packages
array may have the following properties:
Property | Required/Optional for Ingest | Required/Optional for Storage | Description |
---|---|---|---|
package_id |
required | required | URI identifier for the package. MUST be unique within Cornell collections so that it can be used as the primary key for access to packages. Use UUID in URI form, e.g. urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 (following RFC4122 and IANA) for all packages. |
source_path |
required | not-allowed | Must be left blank in ingest manifest and is used by ingest code. Value not retained in storage manifest. |
bibid |
optional | optional | Bibliographic record id this package is associated with, SHOULD be provided if available. (Note that this value is intended for identifying the bibliographic record of the assets specific to this package, rather than for the collection as a whole.) |
local_id |
optional | optional | Physical item identifier, SHOULD be provided by depositor, if available. |
files |
required | required | An array of objects describing each file/object in the manifest. We use files even though they are objects/resources in some storage technologies like AWS S3. |
number_files |
optional | required | The number of entries in the files array, allows self-checking for consistency if present. An integer. |
Each object in the files
array may have the following properties:
Property | Required/Optional for Ingest | Required/Optional for Storage | Description |
---|---|---|---|
filepath |
required | required | Path and filename of the file within the package. The character / MUST be used as a path separator (not \ as is used on Windows systems). Following Bagit, if a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986] |
sha1 |
optional | required | SHA-1 hash of data (hex encoded using lowercase alphas, same as output from sha1sum , e.g. 021ea82f0468043e81a734b1342b1e64904672b0 ). If present for ingest, it will be verified; otherwise it will be calculated by ingest code. |
md5 |
optional | optional | MD5 hash of data (hex encoded using lowercase alphas, same as output from md5sum , e.g. d41d8cd98f00b204e9800998ecf8427e ). May or may not be present on ingest, will be verified and retained if present |
size |
optional | required | Size of the file in bytes, an integer value. If not present for ingest, will be calculated by ingest code. |
ingest_date |
not-allowed | required | Date of ingest of the file. |
tool_version |
required | required | Must be left blank in ingest manifest. String representing the tool and version of the file identification utility run. (e.g., tika-2.1.0 ) |
media_type |
required | required | Must be left blank in ingest manifest. The media type of the file referenced by filepath using the tool referenced in tool_version . |