Support remote images
Opened this issue · 3 comments
We frequently have the use-case where some (or even all) the file references have not been downloaded yet.
But these URL references for images make OcrdBrowser stumble:
today at 22:59:06Traceback (most recent call last):
today at 22:59:06 File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/window.py", line 92, in _open
today at 22:59:06 self.page_list.set_document(self.document)
today at 22:59:06 File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_browser.py", line 39, in set_document
today at 22:59:06 self.model = PageListStore(self.document)
today at 22:59:06 File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_store.py", line 57, in __init__
today at 22:59:06 file_lookup = document.get_image_paths(self.file_group)
today at 22:59:06 File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 275, in get_image_paths
today at 22:59:06 image_paths[page_id] = self.path(images[0])
today at 22:59:06 File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 169, in path
today at 22:59:06 return self.directory.joinpath(other.local_filename)
today at 22:59:06 File "/usr/local/lib/python3.7/pathlib.py", line 922, in joinpath
today at 22:59:06 return self._make_child(args)
today at 22:59:06 File "/usr/local/lib/python3.7/pathlib.py", line 704, in _make_child
today at 22:59:06 drv, root, parts = self._parse_args(args)
today at 22:59:06 File "/usr/local/lib/python3.7/pathlib.py", line 658, in _parse_args
today at 22:59:06 a = os.fspath(a)
today at 22:59:06TypeError: expected str, bytes or os.PathLike object, not NoneType
That's because in …
browse-ocrd/ocrd_browser/model/document.py
Lines 175 to 176 in d6ff3f3
… we do not differentiate between an OcrdFile
's .local_filename
(which may be empty) and its .url
. The latter could still be downloaded into the document.directory
under some name and returned here.
Or perhaps one could somehow make this downloading a lazy operation only to be triggered when actually needed.
we do not differentiate between an
OcrdFile
's.local_filename
(which may be empty) and its.url
. The latter could still be downloaded into thedocument.directory
under some name and returned here.Or perhaps one could somehow make this downloading a lazy operation only to be triggered when actually needed.
BTW, that's also how most OCR-D processors handle this. They rely on Workspace.download_file
, which for non-local files will automatically download from the URL and store in the workspace (without actually changing the METS but with a reproducible local path, so subsequent attempts will use the local copy).
See support_remote_images branch for progress
One additional feature wish: A graceful way to handle failing downloads, e.g. showing just a placeholder image instead of crashing outright. This does happen in our collection for files in the PRESENTATION
fileGrp which references files by file://
URL that are not actually usable outside the network:
<mets:fileGrp USE="PRESENTATION">
<mets:file ID="FILE_0001_PRESENTATION" MIMETYPE="image/tiff">
<mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="file:///goobi/tiff001/sbb/PPN680203753/00000001.tif"/>
</mets:file>
I know that we should fix that on our side but that is not trivial to do and we're probably not the only ones (mis)using mets:FLocat
like this.