Problem: mets-reader-writer places its own restrictions on METS profiles it can read and then validate (metsrw)
ross-spencer opened this issue ยท 1 comments
Expected behaviour
It may be desirable (convenient?) to load any METS document into mets-reader-writer to perform validation against the METS schema.
Current behaviour
Mets-reader-writer places limits on what can be imported during its load processes by seeking the existence of various properties within the METS when it is loaded. The reader-writer could potentially be more general purpose.
As an example:
As mets-reader-writer loads XML from a file, it then calls the following functions:
In _parse_tree
we seek the existence of a physical structMap and raise an error if one isn't found: raise exceptions.ParseError("No physical structMap found.")
A structmap however isn't a mandatory element of a METS file. And here (1.12) looking specifically for a physical structMap
is also an additional stipulation affecting our ability to load any particular METS.
Steps to reproduce
A sample structmap that will fail validation is as follows:
<?xml version="1.0" encoding="utf-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/">
<mets:structMap TYPE="logical">
<mets:div TYPE="book" LABEL="How to create a hierarchical book">
<mets:div TYPE="page" LABEL="Cover">
<mets:fptr FILEID="cover.jpg"/>
</mets:div>
<mets:div TYPE="page" LABEL="Inside cover">
<mets:fptr FILEID="inside_cover.jpg"/>
</mets:div>
<mets:div TYPE="chapter" LABEL="Chapter 1">
<mets:div TYPE="page" LABEL="Page 1">
<mets:fptr FILEID="page_01.jpg"/>
</mets:div>
<mets:div TYPE="subchapter" LABEL="Subchapter 1.1">
<mets:div TYPE="page" LABEL="Page 2">
<mets:fptr FILEID="page_02.jpg"/>
</mets:div>
<mets:div TYPE="page" LABEL="Page 3">
<mets:fptr FILEID="page_03.jpg"/>
</mets:div>
<mets:div TYPE="page" LABEL="Page 4">
<mets:fptr FILEID="page_04.jpg"/>
</mets:div>
<mets:div TYPE="subchapter" LABEL="Subchapter 1.2">
<mets:div TYPE="page" LABEL="Page 5">
<mets:fptr FILEID="page_05.jpg"/>
</mets:div>
<mets:div TYPE="page" LABEL="Page 6">
<mets:fptr FILEID="page_06.jpg"/>
</mets:div>
<mets:div TYPE="page" LABEL="Page 7">
<mets:fptr FILEID="page_07.jpg"/>
</mets:div>
</mets:div>
<!-- Subchapter 1.2 -->
</mets:div>
<!-- Subchapter 1.1 -->
</mets:div>
<!-- Chapter 1 -->
<!-- Chapters 2 and 3, each with their own subchapters as in Chapter 1, omitted from this example. -->
<mets:div TYPE="afterword" LABEL="Afterword">
<mets:div TYPE="page" LABEL="Page 20">
<mets:fptr FILEID="page_20.jpg"/>
</mets:div>
</mets:div>
<!-- afterword -->
<mets:div TYPE="index" LABEL="Index">
<mets:div TYPE="page" LABEL="Index, page 1">
<mets:fptr FILEID="index_01.jpg"/>
</mets:div>
<mets:div TYPE="page" LABEL="Index, page 2">
<mets:fptr FILEID="index_02.jpg"/>
</mets:div>
</mets:div>
<!-- index -->
<mets:div TYPE="page" LABEL="Back cover">
<mets:fptr FILEID="back_cover.jpg"/>
</mets:div>
<!-- back cover -->
</mets:div>
<!-- book -->
</mets:structMap>
</mets:mets>
An attempt to load this will result in the following stack trace:
Traceback (most recent call last):
File "validate-mets.py", line 79, in <module>
main()
File "validate-mets.py", line 75, in main
use_mets(args.mets)
File "validate-mets.py", line 47, in use_mets
mets = load_mets(filename[0])
File "validate-mets.py", line 35, in load_mets
mets = metsrw.METSDocument.fromfile(filename) # Reads a file
File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 563, in fromfile
return cls.fromtree(etree.parse(path, parser=parser))
File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 587, in fromtree
mets._parse_tree(tree)
File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 506, in _parse_tree
raise exceptions.ParseError("No physical structMap found.")
metsrw.exceptions.ParseError: No physical structMap found.
Your environment (version of Archivematica, OS version, etc)
metsrw-0.3.7
.
Additional context
Validation could be done via mets-rw
for custom structmaps rather than via xmllint
in archivematicaVerifyMETS.sh
.
For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:
- All PRs related to this issue are properly linked ๐
- All PRs related to this issue have been merged ๐
- Test plan for this issue has been implemented and passed ๐
- Documentation regarding this issue has been written and it has been added to the release notes, if needed ๐
A similar issue from the mets-rw repo with additional sampled-data to consider: artefactual-labs/mets-reader-writer#36