Living-with-machines/alto2txt

Missing page (alto) from article not flagged

Opened this issue · 0 comments

To create the _art0001.txt, the text block IDs are defined in the METS file as follows:

<mets:smLocatorLink xlink:href="#art0001" xlink:label="article" xlink:type="locator"/>
    <mets:smLinkGrp>
    <mets:smLocatorLink xlink:href="#pa0002001" xlink:label="page2 area1" xlink:type="locator"/>
    <mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page2 area1" ARCTYPE="logicalphysical"/>
</mets:smLinkGrp>

Where #art0001 defines the created .txt file (PUBID_YYYYMMDD_art0001.txt)
#pa0002001 defines the paragraph ID for the textblock within the source .xml file (PUBID_YYYYMMDD_0002.xml) . The 0002 in the paragraph ID refers to the xml file number.

And ..._0002.xml can't be read or does not exist, then art0001.txt will be created empty with no obvious warning (it is potentially in the log though).


Extended example with hypothetical situation where article crosses two pages:

<mets:smLinkGrp>
    <mets:smLocatorLink xlink:href="#art0001" xlink:label="article" xlink:type="locator"/>
    <mets:smLocatorLink xlink:href="#pa0001041" xlink:label="page1 area41" xlink:type="locator"/>
    <mets:smLocatorLink xlink:href="#pa0002001" xlink:label="page2 area1" xlink:type="locator"/>
    <mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page1 area41" ARCTYPE="logicalphysical"/>
    <mets:smArcLink xlink:type="arc" xlink:from="article" xlink:to="page2 area1" ARCTYPE="logicalphysical"/>
</mets:smLinkGrp>

In this made up example, an article spans two physical pages and therefore two xml files source files. (This scenario may not actually happen, I don't know if articles can be defined this way.) If ..._0002.xml does not exist or can't be read correctly, the subsequent art0001.txt file is created with this expected text missing with no clear indication this has happened.