Simplification and enhancement of metadata json in export packages
karenhanson opened this issue · 2 comments
Feature Description
This is feedback related to the Mellon-funded Embedding Preservability project and from Portico's analysis of how we can use the export package metadata for preservation.
The current arrangement of the metadata files in the export package makes it difficult to automatically extract consistent descriptive metadata at each level of the project. There is also some key information missing from the metadata - this would make the package more difficult to comprehend in the absence of the Manifold platform to render it. We suggest the following refinements to the metadata to facilitate a richer, more elegant preservation approach:
- The project-level and text-level metadata sometimes does not include key fields, such as title – it would be helpful if a metadata.json file containing core descriptive metadata were found in these locations in the package:
- /data/metadata.json (representing project metadata found on the hero page)
- /data/texts/{book-x}/metadata.json (representing bibliographic metadata for the text)
- /data/resources/{resource-y}/metadata.json (representing descriptive metadata for the resource)
Ideally these 3 metadata.json files would share field names/have a similar data model where they overlap (title, description, creators etc.) to make them easier to process and validate at scale.
- For Resources specifically, a single metadata.json in the root of the resource folder could replace the handful of small metadata files that are found in each resource folder (currently metadata.json, kind, external_url, txt/caption, txt/description, txt/title etc.) The current arrangement makes them especially difficult to process and validate.
- When there are multiple texts in a package, it’s not clear which is the core text. It would be helpful if the project-level metadata.json indicated which text is the core text so that we can highlight that as separate from the other supporting texts – otherwise we would need to list all texts as equally important.
Why is this feature important? Who does it help?
Anyone who needs to transform the export package metadata e.g. during a platform migration would benefit from these improvements. I'm specifically logging this to facilitate a richer preservation approach that allows us to extract metadata at each level of the project and better manage the archived copy in the future.
User Stories
As a preservation archive I'd like the metadata in the export packages to be simpler and more consistent so that it is easier understand and extract critical descriptive information to (1) facilitate discovery at the resource, text, and project level and (2) capture information about the arrangement of the work.
Design Notes
Background that might support design considerations: Portico typically extracts and normalizes key descriptive metadata for e-books at a book-level. This metadata is used to generate a rich index of the archive's content to ensure items can be discovered and retrieved. As part of the descriptive metadata normalization process, we also perform some validation actions to ensure we have the minimum metadata required. Since Manifold supports projects with multiple texts and resources, we would like to be able to extract the descriptive metadata for each component and record the relationships between the project and it's component parts. This would allow us to manage each part separately, supporting a more granular approach to preserving these projects. The current arrangement of the metadata files in the export package would make this very difficult and costly to execute or validate well. Our only practical option at the moment would be to ingest the entire package and extract project-level metadata only - future users would need to work to determine how the pieces fit together and how it should render. We would prefer to extract richer information now if it can be done elegantly so that we have a pathway to facilitate future rendering of these projects at scale.
Development Notes
N/A
Linked to PROJ-2842 internally.
Included in v8.