archivematica/Issues

Problem: mets-reader-writer places its own restrictions on METS profiles it can read and then validate (metsrw)

ross-spencer opened this issue ยท 1 comments

Expected behaviour

It may be desirable (convenient?) to load any METS document into mets-reader-writer to perform validation against the METS schema.

Current behaviour

Mets-reader-writer places limits on what can be imported during its load processes by seeking the existence of various properties within the METS when it is loaded. The reader-writer could potentially be more general purpose.

As an example:

As mets-reader-writer loads XML from a file, it then calls the following functions:

  • fromtree: here
    Which then calls:

  • _parse_tree: here.

In _parse_tree we seek the existence of a physical structMap and raise an error if one isn't found: raise exceptions.ParseError("No physical structMap found.")

A structmap however isn't a mandatory element of a METS file. And here (1.12) looking specifically for a physical structMap is also an additional stipulation affecting our ability to load any particular METS.

Steps to reproduce

A sample structmap that will fail validation is as follows:

<?xml version="1.0" encoding="utf-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/">
  <mets:structMap TYPE="logical">
    <mets:div TYPE="book" LABEL="How to create a hierarchical book">
      <mets:div TYPE="page" LABEL="Cover">
        <mets:fptr FILEID="cover.jpg"/>
      </mets:div>
      <mets:div TYPE="page" LABEL="Inside cover">
        <mets:fptr FILEID="inside_cover.jpg"/>
      </mets:div>
      <mets:div TYPE="chapter" LABEL="Chapter 1">
        <mets:div TYPE="page" LABEL="Page 1">
          <mets:fptr FILEID="page_01.jpg"/>
        </mets:div>
        <mets:div TYPE="subchapter" LABEL="Subchapter 1.1">
          <mets:div TYPE="page" LABEL="Page 2">
            <mets:fptr FILEID="page_02.jpg"/>
          </mets:div>
          <mets:div TYPE="page" LABEL="Page 3">
            <mets:fptr FILEID="page_03.jpg"/>
          </mets:div>
          <mets:div TYPE="page" LABEL="Page 4">
            <mets:fptr FILEID="page_04.jpg"/>
          </mets:div>
          <mets:div TYPE="subchapter" LABEL="Subchapter 1.2">
            <mets:div TYPE="page" LABEL="Page 5">
              <mets:fptr FILEID="page_05.jpg"/>
            </mets:div>
            <mets:div TYPE="page" LABEL="Page 6">
              <mets:fptr FILEID="page_06.jpg"/>
            </mets:div>
            <mets:div TYPE="page" LABEL="Page 7">
              <mets:fptr FILEID="page_07.jpg"/>
            </mets:div>
          </mets:div>
          <!-- Subchapter 1.2 -->
        </mets:div>
        <!-- Subchapter 1.1 -->
      </mets:div>
      <!-- Chapter 1 -->
      <!-- Chapters 2 and 3, each with their own subchapters as in Chapter 1, omitted from this example. -->
      <mets:div TYPE="afterword" LABEL="Afterword">
        <mets:div TYPE="page" LABEL="Page 20">
          <mets:fptr FILEID="page_20.jpg"/>
        </mets:div>
      </mets:div>
      <!-- afterword -->
      <mets:div TYPE="index" LABEL="Index">
        <mets:div TYPE="page" LABEL="Index, page 1">
          <mets:fptr FILEID="index_01.jpg"/>
        </mets:div>
        <mets:div TYPE="page" LABEL="Index, page 2">
          <mets:fptr FILEID="index_02.jpg"/>
        </mets:div>
      </mets:div>
      <!-- index -->
      <mets:div TYPE="page" LABEL="Back cover">
        <mets:fptr FILEID="back_cover.jpg"/>
      </mets:div>
      <!-- back cover -->
    </mets:div>
    <!-- book -->
  </mets:structMap>
</mets:mets>

An attempt to load this will result in the following stack trace:

Traceback (most recent call last):
  File "validate-mets.py", line 79, in <module>
    main()
  File "validate-mets.py", line 75, in main
    use_mets(args.mets)
  File "validate-mets.py", line 47, in use_mets
    mets = load_mets(filename[0])
  File "validate-mets.py", line 35, in load_mets
    mets = metsrw.METSDocument.fromfile(filename)  # Reads a file
  File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 563, in fromfile
    return cls.fromtree(etree.parse(path, parser=parser))
  File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 587, in fromtree
    mets._parse_tree(tree)
  File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 506, in _parse_tree
    raise exceptions.ParseError("No physical structMap found.")
metsrw.exceptions.ParseError: No physical structMap found.

Your environment (version of Archivematica, OS version, etc)

metsrw-0.3.7.

Additional context

Validation could be done via mets-rw for custom structmaps rather than via xmllint in archivematicaVerifyMETS.sh.


For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

  • All PRs related to this issue are properly linked ๐Ÿ‘
  • All PRs related to this issue have been merged ๐Ÿ‘
  • Test plan for this issue has been implemented and passed ๐Ÿ‘
  • Documentation regarding this issue has been written and it has been added to the release notes, if needed ๐Ÿ‘

A similar issue from the mets-rw repo with additional sampled-data to consider: artefactual-labs/mets-reader-writer#36