base file group other than OCR-D-IMG
Closed this issue · 5 comments
I have a METS here which does contain a fileGrp OCR-D-IMG
, but not comprising all physical pages. This gives me:
INFO ocrd.resolver.workspace_from_nothing - Writing METS to /tmp/ocrd-core-ttugo0kk/mets.xml
Traceback (most recent call last):
File "ocrd_browser/ui/window.py", line 88, in _open
self.page_list.set_document(self.document)
File "ocrd_browser/ui/page_browser.py", line 39, in set_document
self.model = PageListStore(self.document)
File "ocrd_browser/ui/page_store.py", line 56, in __init__
file = str(file_lookup[page_id])
KeyError: 'f00037100714864
So I digged into ocrd_browser.ui.page_store
and thought it might be sufficient to just check page_id in file_lookup
before appending a row to the Gtk list. But this raises bigger questions:
-
Why should the initial view be restricted to pages contained in
OCR-D-IMG
at all? This could easily just be empty. With practical library systems, the initial image fileGrp could realistically be calledMAX
,ORIGINAL
or something else instead. My understanding of this program is that it should try to present a view of all physical pages (at least initially, before selecting a fileGrp explicitly). So how about presenting all structMap entries sorted by their@ORDER
(if present) or@ID
with the first fptr that shows up? -
How do you change to a different fileGrp?
ui.view.base
has aView.use_file_group
property fixed toOCR-D-IMG
.
That's a valid question, and I had the problem myself (original fileGrp not named OCR-D-IMG
).
Solutions to your questions:
- The fileGrp to display with
PageListStore
should not be hardcoded toOCR-D-IMG
, but should be selectable like inViewImages
. As a default it should try:
- The first of a (configurable) list of preferred fileGroups to display as images ( OCR-D-IMG, MAX, ORIGINAL) which have a mime-type matching
image/*
. - The first fileGroup (sorted by ???? maybe string length, because derived images usually have more complex name than the original?) which has a mime-type matching
image/*
. - If there is no match according to the
page_id in file_lookup
-logic you described, display a "missing image"-icon
View.use_file_group
is overridden inViewXml
andViewImages
.OCR-D-IMG
is just the default value for all possible views. Theuse_file_group
implementations in these views are actually quite robust and are taking the user selection and availability of the selected fileGrp into account. I think the way to go is to basePageListStore
on the same implementation.
What do you think?
2: Oh, I see! Yes, sounds reasonable to base the initial view on that as well.
1: Yes, this would be very intuitive behaviour and easy to use IMHO. Or (instead of the second criterion) one could even start with an empty view if the first criterion (fixed/configured list of preferred groups) does not yield any images.
I hope I have fixed most of the points now, except the selectable file group, which is quite difficult to implement (but absolutely worth it) and now handled in #9
Why should the initial view be restricted to pages contained in
OCR-D-IMG
The initial view file_group
is now determined by the algorithm outlined in my comment, point 1., the (quite dirty) implementation is here
So how about presenting all structMap entries sorted by their
@ORDER
(if present) or@ID
with the first fptr that shows up?
The page browser now uses all page_ids from ocrd_models.ocrd_mets.OcrdMets.physical_pages
(but without taking @ORDER
into account) to determine which pages "exist". It then tries to find matching image files from a given file_group to display. If no image is found for the page a "missing-image" icon is displayed.
I hope I have fixed most of the points now, except the selectable file group, which is quite difficult to implement (but absolutely worth it) and now handled in #9
Great work!
The initial view
file_group
is now determined by the algorithm outlined in my comment, point 1., the (quite dirty) implementation is here
Wow, you even have a heuristic for the length of the candidate fileGrps in there!
The page browser now uses all page_ids from
ocrd_models.ocrd_mets.OcrdMets.physical_pages
(but without taking@ORDER
into account) to determine which pages "exist". It then tries to find matching image files from a given file_group to display. If no image is found for the page a "missing-image" icon is displayed.
Works perfectly, many thanks!