hnesk/browse-ocrd

other MIME types

Closed this issue · 2 comments

Without digging, I am not sure why exactly, trying to open the PAGE-XML view on a workspace with ALTO files (text/xml) gives this:

  File "ocrd_browser/view/base.py", line 66, in <lambda>
    configurator.connect('changed', lambda _source, *value: self.config_changed(name, value))
  File "ocrd_browser/view/xml.py", line 50, in config_changed
    self.reload()
  File "ocrd_browser/view/base.py", line 86, in reload
    self.current = self.document.page_for_id(self.page_id, self.use_file_group)
  File "ocrd_browser/model/document.py", line 356, in page_for_id
    image, _, _ = self.workspace.image_from_page(pcgts.get_Page(), page_id)
  File "ocrd/workspace.py", line 384, in image_from_page
    page_image = self._resolve_image_as_pil(page.imageFilename)
  File "ocrd/workspace.py", line 295, in _resolve_image_as_pil
    pil_image = Image.open(image_filename)
  File "PIL/Image.py", line 2930, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file 'FULLTEXT/FILE_0001_FULLTEXT'

Looks like it tried to interpret this as an image (and make a PAGE-XML for it).

hnesk commented

Sorry, I can't reproduce that. Do you have an example workspace?

Do you have an example workspace?

I do:

ocrd workspace clone -a "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-39946221X-18560530"
browse-ocrd mets.xml

(Here, FULLTEXT contains ALTO files correctly specified as text/xml, which our new document.page_for_id tries to pick up as PAGE-XML. However, with the current version I don't see the above crash anymore – now the PageView and TextView simply forbid selecting any fileGrps.)