docker: mixing preinstalled and user-downloaded resources
Closed this issue · 12 comments
In https://ocr-d.de/en/models we have this paragraph:
To download models to
./models
in the host FS and/usr/local/share/ocrd-resources
in Docker:docker run --user $(id -u) \ --volume $PWD/models:/usr/local/share/ocrd-resources \ ocrd/all \ ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \ ocrd resmgr download ocrd-calamari-recognize default\; \ ...To run processors, as usual do:
docker run --user $(id -u) --workdir /data \ --volume $PWD/data:/data \ --volume $PWD/models:/usr/local/share/ocrd-resources \ ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model engThis principle applies to all
ocrd/*
Docker images, e.g. you can replaceocrd/all
above withocrd/tesserocr
as well.
@kba came up with the question if that is still correct for ocrd-tesserocr-*
:
https://github.com/OCR-D/ocrd-website/pull/348/files/daf4fce2135f58b4f4ba43ef30e375e161e89651#r1159502451
ah, did not see the issue linked directly from the review discussion.
Copying my follow-ups:
So we do still have a problem with our model volume logic here. Modules like ocrd_tesserocr or workflow-configuration (ocrd-page-transform
...) want to have their stuff under /usr/local/share/XYZ, while others use /usr/local/share/ocrd-resources.
Everything is ok for the preinstalled resources (tool json files for bashlib processors, preset files for ocrd-page-transform, minimal models for Tesseract). But as soon as you want to install additional models persistently, we cannot offer anything ATM.
(So this is not just about the right kind of recipe covering the volume mapping, but accomodating a single module
location inside the Docker image – because it is prebuilt – with persistent updates we usually do via data
location...)
So we do still have a problem with our model volume logic here. Modules like ocrd_tesserocr or workflow-configuration (
ocrd-page-transform
...) want to have their stuff under /usr/local/share/XYZ, while others use /usr/local/share/ocrd-resources.Everything is ok for the preinstalled resources (tool json files for bashlib processors, preset files for ocrd-page-transform, minimal models for Tesseract). But as soon as you want to install additional models persistently, we cannot offer anything ATM.
(So this is not just about the right kind of recipe covering the volume mapping, but accomodating a single
module
location inside the Docker image – because it is prebuilt – with persistent updates we usually do viadata
location...)
For a native installation based on a contiguous file system, this is not an issue, ocrd resmgr
inspects the resource location priority from the processor's ocrd-tool.json, if applicable the --dump-module-dir
and puts the files in the right place.
For running docker containers and volume mapping, there are two possible solutions AFAICS:
- Provide additional
--volume
options for every processor that usesmodule
resources - Symlink the
module
locations inside the docker container back to thedata
location mounted, i.e./usr/local/share/ocrd-resources
The advantage of 1. is that it requires no change in ocrd_all and allows full flexibility. The drawbacks are that users must know about this issue, it is error-prone and makes the calls more complicated.
The advantage of 2. is that it is transparent to the users, they only need one --volume
for all the resources, as it is currently documented. The drawback is that we need to implemente this into the container build process which makes it slightly more complex and it breaks the distinction between resource location.
I would prefer 2. unless there are fundamental issues with that approach I'm not seeing.
I would do it like this:
- Maintain a list or generate a list of procesors that prefer the
module
location - After building the image, iterate over those processors
ocrd-p
:mv $(ocrd-p --dump-module-dir) /usr/local/share/ocrd-resources/ocrd-p
ln -s /usr/local/share/ocrd-resources/ocrd-p $(ocrd-p --dump-module-dir)
Now, if you mount with --volume $PWD/models:/usr/local/share/ocrd-resources
, and run ocrd resmgr download ocrd-p some-resource.ext
, it will be available from the moduledir
location but persisted in the data
location.
- Provide additional
--volume
options for every processor that usesmodule
resources
No even that does not work, like I said above. If you mount a volume for these paths, then the preinstalled resources will become hidden (minimum default models from install-tesseract
in ocrd_all/tesseract's case).
2. Symlink the
module
locations inside the docker container back to thedata
location mounted, i.e./usr/local/share/ocrd-resources
[...]
I would do it like this:
[...]
Yes, but you would have to do this step after the models volume is already mounted, so the mv
will effectively go from container to host storage – because in a docker run --volume $PWD/models:/usr/local/share/ocrd-resources
, everything under /usr/local/share/ocrd-resources
inside the container gets masked.
So this would have to be some sort of a post-installation / run-time script. (Perhaps we could define an ENTRYPOINT
for this: a shell script that checks if /usr/local/share/ocrd-resources has been mounted, then iterates all the module directories, if the symlink is already in place skips it, otherwise swaps the preinstalled resources for symlinks. And in the end, it delegates to the CMD
.)
@kba Do I understand correctly that the suggestions provided here are related to https://github.com/OCR-D/ocrd_all and we can move the issue to the repo? The suggestion by @bertsky sounds like something that should happen automatically and therefore does not have to be documented in a Setup Guide/User Guide
So this would have to be some sort of a post-installation / run-time script. (Perhaps we could define an ENTRYPOINT for this: a shell script that checks if /usr/local/share/ocrd-resources has been mounted, then iterates all the module directories, if the symlink is already in place skips it, otherwise swaps the preinstalled resources for symlinks. And in the end, it delegates to the CMD.)
Or what do you think?
So our final idea was as follows:
Symlink the
module
locations inside the docker container back to thedata
location mounted, i.e./usr/local/share/ocrd-resources
do this step after the models volume is already mounted, so themv
will effectively go from container to host storage – because in adocker run --volume $PWD/models:/usr/local/share/ocrd-resources
, everything under/usr/local/share/ocrd-resources
inside the container gets masked.
define anENTRYPOINT
for this: a shell script that
- checks if /usr/local/share/ocrd-resources has been mounted
- then iterates all the module directories:
DATADIR=/usr/local/share/ocrd-resources/$EXECUTABLE if [[ ! -d $DATADIR ]]; then # make a persistent copy cp -r $MODULEDIR $DATADIR fi # make sure downloaded models get persisted, too ln -fs $DATADIR $MODULEDIR
- in the end, it delegates to the
CMD
Alas, it does not work like this. Module directories frequently contain much more than just resource files (hence our file type filter in ocrd_utils.list_all_resources
). More importantly, modules contain more than one processor – so the symlinking step (even with --no-target-directory
) will clash. Also, the whole procedure is slow (even when using ocrd-all-tool.json), because each processor will have to be executed for --dump-module-dir
.
- Provide additional
--volume
options for every processor that usesmodule
resourcesNo even that does not work, like I said above. If you mount a volume for these paths, then the preinstalled resources will become hidden (minimum default models from
install-tesseract
in ocrd_all/tesseract's case).
But if you use a named volume, then by default (unless using nocopy
option), the container-internal files will in fact be copied to the host volume!
I have tested this – it works beautifully:
docker run -v tessdata:/usr/local/share/tessdata ocrd/all:maximum-cuda ocrd resmgr download ocrd-tesserocr-recognize frak2021.traineddata
docker run -v tessdata:/usr/local/share/tessdata ocrd/all:maximum-cuda ocrd-tesserocr-recognize -P model frak2021 ...
So one gets persistence and internal content. One can even interact with the filesystem on the host, it's just under a path which Docker chooses, in this case:
docker volume inspect tessdata | jq '.[] | .Mountpoint'
(which yields /data/docker/volumes/tessdata/_data
).
So all we need to do is:
- add a
RUN
step in Dockerfile which iterates all processors and aggregates their--dump-module-dir
, as described here - add a
RUN
step in Dockerfile which iterates all these module directories and moves them into$XDG_DATA_HOME/ocrd-resources/EXECUTABLE
but then also symlinks in reverse direction
Change our documentation to use a named volume, e.g. -v ocrd-models:/usr/local/share/ocrd-resources
instead of a bind-mount. (For /data, we can of course keep the bind-mount.)
(Care must be taken during 2 though to make sure that for each module, only one executable gets swapped like this, the others simply get symlinked from the first executable. Of course, if our ocrd-tool.json had specs on which executable needs which resources, it would be cleaner, but this should suffice.)
But I doubt we should really mount every processors' module directory (i.e. usually their module's distribution directory) like this. Because of editable mode, that would even move all source files to the named volume!
The only actual case where we have a processor that has downloadable resources, but can only use the module location, is ocrd-tesserocr-recognize. So we are indeed talking about an exception for now.
Therefore, I propose going even simpler: setting TESSDATA = $(XDG_DATA_HOME)/ocrd-resources/ocrd-tesserocr-recognize
instead of
Line 799 in 8a68597
Thus, when running with -v ocrd-models:/usr/local/share/ocrd-resources
, nothing further needs to be done.
Thus, when running with
-v ocrd-models:/usr/local/share/ocrd-resources
, nothing further needs to be done.
Having implemented that in #380 and documented it in OCR-D/ocrd-website#357, let's go one step further: Why not just symlink the internal /usr/local/share/ocrd-resources
to the simpler /models
, so users have less effort typing (and remembering) these calls (i.e. it would simply become -v $PWD:/data -v ocrd-models:/models
)?
let's go one step further: Why not just symlink the internal
/usr/local/share/ocrd-resources
to the simpler/models
done (in both PRs).