hnesk/browse-ocrd

Support remote images

Opened this issue · 3 comments

We frequently have the use-case where some (or even all) the file references have not been downloaded yet.

But these URL references for images make OcrdBrowser stumble:

today at 22:59:06Traceback (most recent call last):
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/window.py", line 92, in _open
today at 22:59:06    self.page_list.set_document(self.document)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_browser.py", line 39, in set_document
today at 22:59:06    self.model = PageListStore(self.document)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_store.py", line 57, in __init__
today at 22:59:06    file_lookup = document.get_image_paths(self.file_group)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 275, in get_image_paths
today at 22:59:06    image_paths[page_id] = self.path(images[0])
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 169, in path
today at 22:59:06    return self.directory.joinpath(other.local_filename)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 922, in joinpath
today at 22:59:06    return self._make_child(args)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 704, in _make_child
today at 22:59:06    drv, root, parts = self._parse_args(args)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 658, in _parse_args
today at 22:59:06    a = os.fspath(a)
today at 22:59:06TypeError: expected str, bytes or os.PathLike object, not NoneType

That's because in …

if isinstance(other, OcrdFile):
return self.directory.joinpath(other.local_filename)

… we do not differentiate between an OcrdFile's .local_filename (which may be empty) and its .url. The latter could still be downloaded into the document.directory under some name and returned here.

Or perhaps one could somehow make this downloading a lazy operation only to be triggered when actually needed.

we do not differentiate between an OcrdFile's .local_filename (which may be empty) and its .url. The latter could still be downloaded into the document.directory under some name and returned here.

Or perhaps one could somehow make this downloading a lazy operation only to be triggered when actually needed.

BTW, that's also how most OCR-D processors handle this. They rely on Workspace.download_file, which for non-local files will automatically download from the URL and store in the workspace (without actually changing the METS but with a reproducible local path, so subsequent attempts will use the local copy).

hnesk commented

See support_remote_images branch for progress

kba commented

One additional feature wish: A graceful way to handle failing downloads, e.g. showing just a placeholder image instead of crashing outright. This does happen in our collection for files in the PRESENTATION fileGrp which references files by file:// URL that are not actually usable outside the network:

<mets:fileGrp USE="PRESENTATION">                                                                                                           
  <mets:file ID="FILE_0001_PRESENTATION" MIMETYPE="image/tiff">                                                                             
    <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="file:///goobi/tiff001/sbb/PPN680203753/00000001.tif"/>
  </mets:file>                                                                                                                              

I know that we should fix that on our side but that is not trivial to do and we're probably not the only ones (mis)using mets:FLocat like this.