slub/mets-mods2tei

Realize an empty publication date if METS header is absent instead of failing with a Python error

tboenig opened this issue · 6 comments

Hi @wrznr,

I use your program with data from sbb.
Here an example:
mm2tei -o "https://oai.sbb.berlin/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:digital.staatsbibliothek-berlin.de:PPN66438790X" >test.tei.xml

A other example from sub goettingen
mm2tei -o "https://gdz.sub.uni-goettingen.de/mets/PPN228873541.mets.xml" >test.tei.xml
Here we find the same ssl problem.

Is the ssl problem a problem on ssb side or a problem in your program?

wrznr commented

Hi @tboenig, could you pls. post some kind of error message to make it easier to get an idea of the error?

here the ssb error:

Traceback (most recent call last):
  File "/usr/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/lib/python3.6/http/client.py", line 1254, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1300, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/usr/lib/python3.6/http/client.py", line 1415, in connect
    server_hostname=server_hostname)
  File "/usr/lib/python3.6/ssl.py", line 407, in wrap_socket
    _context=self, _session=session)
  File "/usr/lib/python3.6/ssl.py", line 817, in __init__
    self.do_handshake()
  File "/usr/lib/python3.6/ssl.py", line 1077, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib/python3.6/ssl.py", line 689, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mets-mods2tei/env/lib/python3.6/site-packages/mets_mods2tei/scripts/mets_mods2tei.py", line 27, in cli
    f = urlopen(mets)
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/usr/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.6/urllib/request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mets-mods2tei/env/bin/mm2tei", line 8, in <module>
    sys.exit(cli())
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "mets-mods2tei/env/lib/python3.6/site-packages/mets_mods2tei/scripts/mets_mods2tei.py", line 29, in cli
    f = open(mets, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'https://oai.sbb.berlin/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:digital.staatsbibliothek-berlin.de:PPN66438790X'

and here the sub goettingen error:
sorry is not the same ssl error

Traceback (most recent call last):
  File "mets-mods2tei/env/bin/mm2tei", line 8, in <module>
    sys.exit(cli())
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "mets-mods2tei/env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "mets-mods2tei/env/lib/python3.6/site-packages/mets_mods2tei/scripts/mets_mods2tei.py", line 35, in cli
    mets.fromfile(f)
  File "mets-mods2tei/env/lib/python3.6/site-packages/mets_mods2tei/api/mets.py", line 112, in fromfile
    self.__spur()
  File "mets-mods2tei/env/lib/python3.6/site-packages/mets_mods2tei/api/mets.py", line 233, in __spur
    self.encoding_date = header.get_CREATEDATE().isoformat()
AttributeError: 'NoneType' object has no attribute 'get_CREATEDATE'
wrznr commented

The former problem is most likely a problem at the host (SBB) or your own institution. Sorry.

The latter problem is caused by the missing metsHdr element in the METS file you want to process (cf. https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-453779263). The METS file from Göttingen contains no information when it was created. But such information is mandatory for valid DTABf. If you have ideas on how to fix this, I will gladly implement them.

Hi @wrznr,

If you have ideas how to fix it, I will be happy to implement them.
my suggestion:

  • ignore the empty or missing metsHdr and make an empty <date type="publication"/> or an error message on cli, i.e. the mets file is not valid. I think a combination would be ideal.

@tboenig I have difficulty implementing these fallbacks/error signals for missing headers, because I cannot find exact documentation of DTAbf and TEI proper.

For example, one of the dependent elements of metsHdr is the mets:agent, which is used for encodingDesc:

self.encoding_desc = list(filter(lambda x: x.get_OTHERTYPE() == "SOFTWARE", header.get_agent()))[0].get_name()

(I don't know why we throw away all but the first agent and all but its name, but granted.)

This information usually ends up in simple p elements:

encoding_desc = self.tree.xpath('//tei:encodingDesc', namespaces=ns)[0]
encoding_desc_details = etree.SubElement(encoding_desc, "%sp" % TEI)
encoding_desc_details.text = "Encoded with the help of %s." % creator

Now, according to DTAbf there is supposed to be an intermittent editorialDecl here. But the only reference I can find on that is in the (IIUC) Examples schema.

So what is the correct representation here, and what should I put in as a fallback in case the metsHdr is missing?