hnesk/browse-ocrd

base file group other than OCR-D-IMG

Closed this issue · 5 comments

I have a METS here which does contain a fileGrp OCR-D-IMG, but not comprising all physical pages. This gives me:

INFO ocrd.resolver.workspace_from_nothing - Writing METS to /tmp/ocrd-core-ttugo0kk/mets.xml
Traceback (most recent call last):
  File "ocrd_browser/ui/window.py", line 88, in _open
    self.page_list.set_document(self.document)
  File "ocrd_browser/ui/page_browser.py", line 39, in set_document
    self.model = PageListStore(self.document)
  File "ocrd_browser/ui/page_store.py", line 56, in __init__
    file = str(file_lookup[page_id])
KeyError: 'f00037100714864

So I digged into ocrd_browser.ui.page_store and thought it might be sufficient to just check page_id in file_lookup before appending a row to the Gtk list. But this raises bigger questions:

  1. Why should the initial view be restricted to pages contained in OCR-D-IMG at all? This could easily just be empty. With practical library systems, the initial image fileGrp could realistically be called MAX, ORIGINAL or something else instead. My understanding of this program is that it should try to present a view of all physical pages (at least initially, before selecting a fileGrp explicitly). So how about presenting all structMap entries sorted by their @ORDER (if present) or @ID with the first fptr that shows up?

  2. How do you change to a different fileGrp? ui.view.base has a View.use_file_group property fixed to OCR-D-IMG.

hnesk commented

That's a valid question, and I had the problem myself (original fileGrp not named OCR-D-IMG).
Solutions to your questions:

  1. The fileGrp to display with PageListStore should not be hardcoded to OCR-D-IMG, but should be selectable like in ViewImages. As a default it should try:
  • The first of a (configurable) list of preferred fileGroups to display as images ( OCR-D-IMG, MAX, ORIGINAL) which have a mime-type matching image/*.
  • The first fileGroup (sorted by ???? maybe string length, because derived images usually have more complex name than the original?) which has a mime-type matching image/*.
  • If there is no match according to the page_id in file_lookup-logic you described, display a "missing image"-icon
  1. View.use_file_group is overridden in ViewXmland ViewImages. OCR-D-IMG is just the default value for all possible views. The use_file_group implementations in these views are actually quite robust and are taking the user selection and availability of the selected fileGrp into account. I think the way to go is to base PageListStore on the same implementation.

What do you think?

2: Oh, I see! Yes, sounds reasonable to base the initial view on that as well.

1: Yes, this would be very intuitive behaviour and easy to use IMHO. Or (instead of the second criterion) one could even start with an empty view if the first criterion (fixed/configured list of preferred groups) does not yield any images.

hnesk commented

I hope I have fixed most of the points now, except the selectable file group, which is quite difficult to implement (but absolutely worth it) and now handled in #9

Why should the initial view be restricted to pages contained in OCR-D-IMG

The initial view file_group is now determined by the algorithm outlined in my comment, point 1., the (quite dirty) implementation is here

So how about presenting all structMap entries sorted by their @ORDER (if present) or @ID with the first fptr that shows up?

The page browser now uses all page_ids from ocrd_models.ocrd_mets.OcrdMets.physical_pages (but without taking @ORDER into account) to determine which pages "exist". It then tries to find matching image files from a given file_group to display. If no image is found for the page a "missing-image" icon is displayed.

I hope I have fixed most of the points now, except the selectable file group, which is quite difficult to implement (but absolutely worth it) and now handled in #9

Great work!

The initial view file_group is now determined by the algorithm outlined in my comment, point 1., the (quite dirty) implementation is here

Wow, you even have a heuristic for the length of the candidate fileGrps in there!

The page browser now uses all page_ids from ocrd_models.ocrd_mets.OcrdMets.physical_pages (but without taking @ORDER into account) to determine which pages "exist". It then tries to find matching image files from a given file_group to display. If no image is found for the page a "missing-image" icon is displayed.

Works perfectly, many thanks!

hnesk commented

I will close this now, for the rest see #9