OCR-D/ocrd_all

docker: mixing preinstalled and user-downloaded resources

Closed this issue · 12 comments

In https://ocr-d.de/en/models we have this paragraph:

To download models to ./models in the host FS and /usr/local/share/ocrd-resources in Docker:

docker run --user $(id -u) \
  --volume $PWD/models:/usr/local/share/ocrd-resources \
ocrd/all \
ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \
ocrd resmgr download ocrd-calamari-recognize default\; \
...

To run processors, as usual do:

docker run --user $(id -u) --workdir /data \
  --volume $PWD/data:/data \
  --volume $PWD/models:/usr/local/share/ocrd-resources \
  ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng

This principle applies to all ocrd/* Docker images, e.g. you can replace ocrd/all above with ocrd/tesserocr as well.

@kba came up with the question if that is still correct for ocrd-tesserocr-*:
https://github.com/OCR-D/ocrd-website/pull/348/files/daf4fce2135f58b4f4ba43ef30e375e161e89651#r1159502451

ah, did not see the issue linked directly from the review discussion.

Copying my follow-ups:

sigh this question keeps being asked...

TESSDATA = $(VIRTUAL_ENV)/share/tessdata/

cd $(VIRTUAL_ENV)/build/tesseract && $(CURDIR)/tesseract/configure --prefix="$(VIRTUAL_ENV)" $(TESSERACT_CONFIG)

ocrd_all/Dockerfile

Lines 32 to 33 in b36cec8

ENV PREFIX=/usr/local
ENV VIRTUAL_ENV $PREFIX

so – no, /usr/local/share/ocrd-resources will be ignored in ocrd_tesserocr when installed via ocrd_all, the module resource location is /usr/local/share/tessdata, as @kba wrote.

So we do still have a problem with our model volume logic here. Modules like ocrd_tesserocr or workflow-configuration (ocrd-page-transform ...) want to have their stuff under /usr/local/share/XYZ, while others use /usr/local/share/ocrd-resources.

Everything is ok for the preinstalled resources (tool json files for bashlib processors, preset files for ocrd-page-transform, minimal models for Tesseract). But as soon as you want to install additional models persistently, we cannot offer anything ATM.

(So this is not just about the right kind of recipe covering the volume mapping, but accomodating a single module location inside the Docker image – because it is prebuilt – with persistent updates we usually do via data location...)

kba commented

So we do still have a problem with our model volume logic here. Modules like ocrd_tesserocr or workflow-configuration (ocrd-page-transform ...) want to have their stuff under /usr/local/share/XYZ, while others use /usr/local/share/ocrd-resources.

Everything is ok for the preinstalled resources (tool json files for bashlib processors, preset files for ocrd-page-transform, minimal models for Tesseract). But as soon as you want to install additional models persistently, we cannot offer anything ATM.

(So this is not just about the right kind of recipe covering the volume mapping, but accomodating a single module location inside the Docker image – because it is prebuilt – with persistent updates we usually do via data location...)

For a native installation based on a contiguous file system, this is not an issue, ocrd resmgr inspects the resource location priority from the processor's ocrd-tool.json, if applicable the --dump-module-dir and puts the files in the right place.

For running docker containers and volume mapping, there are two possible solutions AFAICS:

  1. Provide additional --volume options for every processor that uses module resources
  2. Symlink the module locations inside the docker container back to the data location mounted, i.e. /usr/local/share/ocrd-resources

The advantage of 1. is that it requires no change in ocrd_all and allows full flexibility. The drawbacks are that users must know about this issue, it is error-prone and makes the calls more complicated.

The advantage of 2. is that it is transparent to the users, they only need one --volume for all the resources, as it is currently documented. The drawback is that we need to implemente this into the container build process which makes it slightly more complex and it breaks the distinction between resource location.

I would prefer 2. unless there are fundamental issues with that approach I'm not seeing.

I would do it like this:

  • Maintain a list or generate a list of procesors that prefer the module location
  • After building the image, iterate over those processors ocrd-p:
    • mv $(ocrd-p --dump-module-dir) /usr/local/share/ocrd-resources/ocrd-p
    • ln -s /usr/local/share/ocrd-resources/ocrd-p $(ocrd-p --dump-module-dir)

Now, if you mount with --volume $PWD/models:/usr/local/share/ocrd-resources, and run ocrd resmgr download ocrd-p some-resource.ext, it will be available from the moduledir location but persisted in the data location.

  1. Provide additional --volume options for every processor that uses module resources

No even that does not work, like I said above. If you mount a volume for these paths, then the preinstalled resources will become hidden (minimum default models from install-tesseract in ocrd_all/tesseract's case).

2. Symlink the module locations inside the docker container back to the data location mounted, i.e. /usr/local/share/ocrd-resources
[...]
I would do it like this:
[...]

Yes, but you would have to do this step after the models volume is already mounted, so the mv will effectively go from container to host storage – because in a docker run --volume $PWD/models:/usr/local/share/ocrd-resources, everything under /usr/local/share/ocrd-resources inside the container gets masked.

So this would have to be some sort of a post-installation / run-time script. (Perhaps we could define an ENTRYPOINT for this: a shell script that checks if /usr/local/share/ocrd-resources has been mounted, then iterates all the module directories, if the symlink is already in place skips it, otherwise swaps the preinstalled resources for symlinks. And in the end, it delegates to the CMD.)

@kba Do I understand correctly that the suggestions provided here are related to https://github.com/OCR-D/ocrd_all and we can move the issue to the repo? The suggestion by @bertsky sounds like something that should happen automatically and therefore does not have to be documented in a Setup Guide/User Guide

So this would have to be some sort of a post-installation / run-time script. (Perhaps we could define an ENTRYPOINT for this: a shell script that checks if /usr/local/share/ocrd-resources has been mounted, then iterates all the module directories, if the symlink is already in place skips it, otherwise swaps the preinstalled resources for symlinks. And in the end, it delegates to the CMD.)

Or what do you think?

So our final idea was as follows:

Symlink the module locations inside the docker container back to the data location mounted, i.e. /usr/local/share/ocrd-resources
do this step after the models volume is already mounted, so the mv will effectively go from container to host storage – because in a docker run --volume $PWD/models:/usr/local/share/ocrd-resources, everything under /usr/local/share/ocrd-resources inside the container gets masked.
define an ENTRYPOINT for this: a shell script that

  • checks if /usr/local/share/ocrd-resources has been mounted
  • then iterates all the module directories:
    DATADIR=/usr/local/share/ocrd-resources/$EXECUTABLE
    if [[ ! -d $DATADIR ]]; then
      # make a persistent copy
      cp -r $MODULEDIR $DATADIR
    fi
    # make sure downloaded models get persisted, too
    ln -fs $DATADIR $MODULEDIR
    
  • in the end, it delegates to the CMD

Alas, it does not work like this. Module directories frequently contain much more than just resource files (hence our file type filter in ocrd_utils.list_all_resources). More importantly, modules contain more than one processor – so the symlinking step (even with --no-target-directory) will clash. Also, the whole procedure is slow (even when using ocrd-all-tool.json), because each processor will have to be executed for --dump-module-dir.

  1. Provide additional --volume options for every processor that uses module resources

No even that does not work, like I said above. If you mount a volume for these paths, then the preinstalled resources will become hidden (minimum default models from install-tesseract in ocrd_all/tesseract's case).

But if you use a named volume, then by default (unless using nocopy option), the container-internal files will in fact be copied to the host volume!

I have tested this – it works beautifully:

docker run -v tessdata:/usr/local/share/tessdata ocrd/all:maximum-cuda ocrd resmgr download ocrd-tesserocr-recognize frak2021.traineddata
docker run -v tessdata:/usr/local/share/tessdata ocrd/all:maximum-cuda ocrd-tesserocr-recognize -P model frak2021 ...

So one gets persistence and internal content. One can even interact with the filesystem on the host, it's just under a path which Docker chooses, in this case:

docker volume inspect tessdata | jq '.[] | .Mountpoint'

(which yields /data/docker/volumes/tessdata/_data).

So all we need to do is:

  1. add a RUN step in Dockerfile which iterates all processors and aggregates their --dump-module-dir, as described here
  2. add a RUN step in Dockerfile which iterates all these module directories and moves them into $XDG_DATA_HOME/ocrd-resources/EXECUTABLE but then also symlinks in reverse direction

Change our documentation to use a named volume, e.g. -v ocrd-models:/usr/local/share/ocrd-resources instead of a bind-mount. (For /data, we can of course keep the bind-mount.)

(Care must be taken during 2 though to make sure that for each module, only one executable gets swapped like this, the others simply get symlinked from the first executable. Of course, if our ocrd-tool.json had specs on which executable needs which resources, it would be cleaner, but this should suffice.)

But I doubt we should really mount every processors' module directory (i.e. usually their module's distribution directory) like this. Because of editable mode, that would even move all source files to the named volume!

The only actual case where we have a processor that has downloadable resources, but can only use the module location, is ocrd-tesserocr-recognize. So we are indeed talking about an exception for now.

Therefore, I propose going even simpler: setting TESSDATA = $(XDG_DATA_HOME)/ocrd-resources/ocrd-tesserocr-recognize instead of

TESSDATA = $(VIRTUAL_ENV)/share/tessdata/

Thus, when running with -v ocrd-models:/usr/local/share/ocrd-resources, nothing further needs to be done.

Thus, when running with -v ocrd-models:/usr/local/share/ocrd-resources, nothing further needs to be done.

Having implemented that in #380 and documented it in OCR-D/ocrd-website#357, let's go one step further: Why not just symlink the internal /usr/local/share/ocrd-resources to the simpler /models, so users have less effort typing (and remembering) these calls (i.e. it would simply become -v $PWD:/data -v ocrd-models:/models)?

let's go one step further: Why not just symlink the internal /usr/local/share/ocrd-resources to the simpler /models

done (in both PRs).