Missing page (alto) from article not flagged
Opened this issue · 0 comments
To create the _art0001.txt
, the text block IDs are defined in the METS file as follows:
<mets:smLocatorLink xlink:href="#art0001" xlink:label="article" xlink:type="locator"/>
<mets:smLinkGrp>
<mets:smLocatorLink xlink:href="#pa0002001" xlink:label="page2 area1" xlink:type="locator"/>
<mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page2 area1" ARCTYPE="logicalphysical"/>
</mets:smLinkGrp>
Where #art0001
defines the created .txt file (PUBID_YYYYMMDD_art0001.txt
)
#pa0002001
defines the paragraph ID for the textblock within the source .xml file (PUBID_YYYYMMDD_0002.xml
) . The 0002
in the paragraph ID refers to the xml file number.
And ..._0002.xml
can't be read or does not exist, then art0001.txt
will be created empty with no obvious warning (it is potentially in the log though).
Extended example with hypothetical situation where article crosses two pages:
<mets:smLinkGrp>
<mets:smLocatorLink xlink:href="#art0001" xlink:label="article" xlink:type="locator"/>
<mets:smLocatorLink xlink:href="#pa0001041" xlink:label="page1 area41" xlink:type="locator"/>
<mets:smLocatorLink xlink:href="#pa0002001" xlink:label="page2 area1" xlink:type="locator"/>
<mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page1 area41" ARCTYPE="logicalphysical"/>
<mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page2 area1" ARCTYPE="logicalphysical"/>
</mets:smLinkGrp>
In this made up example, an article spans two physical pages and therefore two xml files source files. (This scenario may not actually happen, I don't know if articles can be defined this way.) If ..._0002.xml
does not exist or can't be read correctly, the subsequent art0001.txt
file is created with this expected text missing with no clear indication this has happened.