/bagit-ro

Research Object BagIt archive

Primary LanguageShellBSD 2-Clause "Simplified" LicenseBSD-2-Clause

Research Object BagIt archive

BagIt is an Internet Draft that specifies a file system structure for transferring and archiving a collection of files, including their checksums and brief metadata.

Research Object bundles is a specification for a structured ZIP-file, based on the ePub and Adobe UCF specifications. The bundle serializes a Research Object, embedding some or all of its resources within the ZIP file, and list the RO content in a manifest, in addition to embedding and referencing annotations and provenance.

A BagIt bag can be considered a mechanism for serialization and transport consistency, while a Research Object can be considered a way to capture identity, annotations and provenance of the resources. As such, the two formats complement each-other. They are however not directly compatible.

This GitHub repository explains by example a profile for a BagIt bag to also be a Research Object. Feel free to provide comments and raise issues, or suggest changes as pull requests.

Run the build.sh script (requires zip, md5sum, sha1sum, find) to generate example1.bagit.zip and the corresponding example1.bundle.zip.

Example overview

Overview of this example:

BagIt overview

A bag in BagIt is a base folder (in this example example1/) that contains the bagit declaration in bagit.txt. A bag contains a payload, the data files that are being transferred, in addition to tag files, metadata for the bag and its content.

A BagIt serialization is typically a tar- or zip-file which contains the base folder. BagIt archives include at the root a subdirectory for the base folder of the bag, e.g. the ZIP file would contain example1/bagit.txt.

The payload of a bag is the files within a directory that is always called data. The data folder may contain arbitrary files and subdirectories. In this example we include a simple CSV data file, an analytical script, and the results of running that script. In addition, a textual README.md is included to describe this execution.

The payload files are listed in one or more manifest files that provide hashes of the file content. The BagIt specification specifies the two most common hashing mechanisms md5 and sha1 to be represented by manifest-md5.txt and manifest-sha1.txt. Other hash mechanisms can also be added (e.g. sha512), but the content of any manifest-* file need to follow the $hash $filename pattern.

Files that are too big to practically include in a BagIt archive can be referenced externally in fetch.txt, which includes the URLs to download, expected file size and destination filenames within the bag base directory. It is undefined in the BagIt specification which Accept* headers should be used in such a retrieval, or if any authentication might be required. This example do not need to make any assumption for this as the referenced external.txt is only available in a single representation. It is undefined in the BagIt specification if the resources in fetch.txt should be considered when creating manifest-* and in Payload-Oxum, this example assumes they should not be included. It is undefined in the BagIt specification what is the expected interpretation if a file in fetch.txt already exists in the bag's data directory.

A bag can also contain other tag files, which would be listed in a separate tag manifest, e.g. tagmanifest-md5.txt and tagmanifest-sha1.txt. In this example, the tag manifest lists the content of the metadata directory. It is undefined in the BagIt specification if the remaining tag files (e.g. bag-info.txt or fetch.txt) should be included in the tag manifest, this example assumes they should not be included.

Research Object overview

A Research Object (RO) is conceptually an aggregation of related resources, an assignment of their identities, and any relevant annotations and provenance statements. The Research Object model specifies how to declare these relations, combining existing Linked Data standard like OAI-ORE, W3C Annotation Data Model and W3C PROV.

Serialized as a Research Object Bundle, some or all of those resources are included in the encapsulating ZIP archive together with a JSON-LD manifest, metadata/manifest.json.

A Research Object BagIt archive follows the same structure as an Research Object Bundle, except that the base directory is the bag base (e.g. example1/), rather than the root folder of the ZIP archive (/). The RO Bundle's .ro/ folder is instead called metadata/ in a Research Object BagIt.

The aggregates section of the manifest list the payload files, both embedded (e.g. ../data/numbers.csv) and external resources (e.g. http://example.com/doc1). Note that local paths are under ../data/, relative to the metadata/ folder.

This aggregates listing provides hooks for additional metadata and provenance, e.g. mediatype, authoredBy and retrievedFrom. A file can claim to conform to a standard, minimum information checklist, requirements or similar using conformsTo.

If more detailed provenance is available, then history can link to a separate provenance trace, e.g. a PROV-O RDF file, although any kind of embedded or external provenance resource could be appropriate (e.g. log file, word document, git repository). Provenance can also be included for the research object itself.

Annotations about any of the resources in the bag (or the RO itself) can be linked to from the annotations section. Here about specifies one or more resources that are annotated, while content links to the annotation content, which could be any aggregated or external resource (e.g ../data/README.md that describes analyse.py, numbers.csv and results.txt), or a metadata file under metadata/annotations/, typically in a Linked Data format. In this example, annotations/numbers.jsonld provide semantic annotations of ../data/numbers.csv in JSON-LD format.

It is customary in Research Object Bundles for non-payload (metadata) files to not be listed under aggregates and to be stored under .ro/. Research Object BagIt archives follow this convention (using metadata/), and in addition the payload files must exclusively be within the data/ folder (or be external URLs). The metadata/ content is listed in the tag manifest, while the data/ payload is listed in the payload manifest with external URLs in the fetch file.

Research Object BagIt archives SHOULD specify the BagIt profile for bagit-ro within bag-info.txt as:

BagIt-Profile-Identifier: https://w3id.org/ro/bagit/profile

Considerations

The combination of BagIt and Research Object adds:

  • RO consistency with checksums for payload and metadata
  • Structured metadata, provenande and annotations for the bag and its content
    • With extensions in JSON-LD using any Linked Data vocabulary
  • Graceful degradation/conversion to plain BagIt or RO Bundle

A RO Bundle is fundamentally not very different from an archived BagIt bag, except that in the RO Bundle, the ro/ is in the root directory together with a marker mimetype file to help mime magic-like tools identify the file type.

BagIt serialization mandates that a BagIt archive contains only a single directory when unpacked, which is the base directory of the bag. While in theory a hybrid RO Bundle and BagIt ZIP archive could exist, it would have to use the bag name .ro and could not include the mimetype file (without a binary zip file hack). In addition the payload would then be contained in .ro/data/, which is not what you would expect from the RO Bundle specification and which would hide all content from Unix/Linux users.

The approach shown here is therefore a variation of RO Bundle which contains the Research Object within the bag of an arbitrary name, thus the RO manifest in a Research Object BagIt archive is in this example at example1/metadata/manifest.json/ rather than .ro/manifest.json.

The interpretation of manifest.json according to the RO Bundle specification assumes / is the root of the ZIP file, to also be the root of the RO.

A BagIt bag is not necessarily rooted within an archive, and could be living standalone within a file system directory, or be exposed on the Web at an arbitrary URL base. The name of the containing bag is not declared outside its directory name. The RO manifest and annotations in this approach therefore uses only relative URI paths, e.g. ../data/analyse.py, while the RO Bundle manifest would have used /data/analyse.py.

Developers can struggle to generate correct relative paths. An alternative approach to move /metadata/manifest.json to /manifest.json could improve on this, but would mean the manifest would no longer be easily usable also as an RO Bundle manifest as its relative paths would differ.

The build.sh script shows how this structure mean that a Research Object BagIt archive can be converted to a Research Object Bundle by adding the mimetype file and simply archiving from within the bag directory.

A similar conversion from RO Bundle to Research Object BagIt would require moving its embedded resources to data/ and rewrite the local paths in its manifest and annotations. See bundle-to-bagit.sh for an example.

Having two kinds of manifests (manifest-sha1.txt and metadata/manifest.json) can be confusing, and can lead to inconsistency if a tool supporting only one of these kind is modifying an RO BagIt.

The bag-info.txt format supports some basic bag-level metadata, e.g. Bagging-Date, Contact-Phone and Organization-Address. While some of these might seem archaic, "other arbitrary metadata elements may also be present.", allowing extensions.

The BagIt specification has no requirements for such alternative elements (e.g. they are not RFC 2822 headers), and it is unclear if any whitespace (e.g. newlines and indentation) form part of the BagIt values or not.

It is recommended that only the basic metadata is provided in bag-info.txt, while more structured metadata and provenance should be provided in the Research Object manifest or annotations.