icatproject/python-icat

ingest.xslt produces invalid ICAT data files

Closed this issue · 1 comments

The ICAT data produced by the reference ingest.xslt file added in #123 and used internally in icat.ingest fails to validate against the icatdata-*.xsd XML schema files.

A typical output after transformation (and reformatting for readability) may look like:

<?xml version="1.0"?>
<icatdata>
  <data>
    <dataset id="Dataset_1">
      <complete>false</complete>
      <description>Dy01Cp02 at 2.7 K</description>
      <endDate>2022-02-03T17:04:22+01:00</endDate>
      <name>testingest_inl_1</name>
      <startDate>2022-02-03T15:40:12+01:00</startDate>
      <investigation ref="_Investigation"/>
      <parameters>
        <stringValue>neutron</stringValue>
        <type name="Probe"/>
      </parameters>
      <parameters>
        <numericValue>5.3</numericValue>
        <type name="Reactor power" units="MW"/>
      </parameters>
      <parameters>
        <numericValue>2.74103</numericValue>
        <rangeBottom>2.7408</rangeBottom>
        <rangeTop>2.7414</rangeTop>
        <type name="Sample temperature" units="K"/>
      </parameters>
      <parameters>
        <numericValue>4.1357</numericValue>
        <rangeBottom>4.0573</rangeBottom>
        <rangeTop>4.1567</rangeTop>
        <type name="Magnetic field" units="T"/>
      </parameters>
      <parameters>
        <stringValue>Dy01Cp02</stringValue>
        <type name="Comment"/>
      </parameters>
      <type name="raw"/>
    </dataset>
  </data>
</icatdata>

Trying to validate that against icatdata-4.4.xsd yields the following errors:

$ xmllint --noout --schema doc/icatdata-4.4.xsd -
-:35: element type: Schemas validity error : Element 'type': This element is not expected. Expected is ( parameters ).
- fails to validate

The error is caused by the order of the elements: the XSD imposes a particular order where all many to one relations (e.g. type) need to come before any one to many relations (e.g. parameters).

Note that this issue may be somewhat nitpicking because class icat.dumpfile_xml.XMLDumpFileReader that consumes that input does not care about the order and that is why the ingest succeeds nevertheless. But still, the XSLT provided with python-icat should generate valid data according to python-icat's own schema.

It turns out, it is even worse than that: also the order imposed by icatdata-5.0.xsd and ingest-10.xsd respectively is inconsistent. icatdata-5.0.xsd imposes as subelements of data: …, dataset, datasetTechnique, datasetInstrument, datasetParameter, …, while ingest-10.xsd imposes: dataset, datasetInstrument, datasetTechnique, datasetParameter. E.g. the order of datasetTechnique and datasetInstrument is inverted. ingest.xslt keeps that order from the input on transformation, so the result is invalid here as well.

We have basically two bad options to fix this:

  • fix it on the input, e.g. fix ingest-10.xsd. This is bad because it has an impact on the input accepted by the icat.ingest module and retroactively changes a released file format version. E.g. input files that were valid ingest files version 1.0 according to python-icat 1.1.0 will be invalid in python-icat 1.2.0.
  • fix it in the transformation, e.g. change the order generated by ingest.xslt. This will make ingest.xslt needlessly complicated, only to keep compatibility with an inconsistent past.

Given the fact that the whole icat.ingest feature was declared experimental in the python-icat 1.1.0 release and I believe it doesn't have much users by now, I tend to go for the breaking first option.