OCR-D/ocrd_all

Conflicting and outdated requirements for calamari-ocr

Closed this issue · 12 comments

ocrd_calamari requires calamari-ocr == 1.0.*, while ocrd_cis requires calamari_ocr == 0.3.5. That causes a conflict which is currently only avoided by using a sub venv headless-tf2 for ocrd_calamari. I'd like to get rid of that sub venv as it does not seem to be necessary otherwise.

In addition, both requirements are really outdated. The latest release is 2.1.4. It is desirable to use that.

ocrd_calamari will be updated to support Calamari 2.1.x.

It will then still need TensorFlow 2, other processors need TensorFlow 1, so you can't get around it.

ocrd_calamari requires calamari-ocr == 1.0.*, while ocrd_cis requires calamari_ocr == 0.3.5.

The latter is an artefact of the very early days (before ocrd_calamari even existed) and will likely be removed soon. See cisocrgroup/ocrd_cis#88

That causes a conflict which is currently only avoided by using a sub venv headless-tf2 for ocrd_calamari.

That interpretation of the situation is wrong. We need headless-tf2 because you cannot have TF2 and TF1 in the same venv. And besides ocrd_calamari, also ocrd_pc_segmentation and ocrd_anybaseocr depend on TF2.

In addition, both requirements are really outdated. The latest release is 2.1.4. It is desirable to use that.

As Mike already pointed out, this is already planned – see OCR-D/ocrd_calamari#61 – but is not an ocrd_all issue.

Can we close?

That interpretation of the situation is wrong. We need headless-tf2 because you cannot have TF2 and TF1 in the same venv. And besides ocrd_calamari, also ocrd_pc_segmentation and ocrd_anybaseocr depend on TF2.

TF2 can simply be installed in the default venv. There is no need for an extra headless-tf2 venv as soon as the conflict is fixed, for example by removing ocrd_cis.

The issue can be closed after removing that outdated submodule.

Maybe we can agree that submodules which use recent software versions should ideally be installed in the default venv.

Additional venvs should be avoided where possible. They cannot be avoided for submodules which use incompatible old software versions, so for example headless-tf1 is needed as long as submodules depend on Tensorflow 1.x

If ocrd_cis requires calamari 0.3.5 then it should probably go to the environment supporting TF 1.x.

for example by removing ocrd_cis.

The issue can be closed after removing that outdated submodule

What are you talking about? ocrd_cis is one of the most important modules we have.

TF2 can simply be installed in the default venv

TF in the main venv has brought us into mutual dependency hell with other modules in the past (numpy IIRC), probably does today, and certainly will in the future. Abandoning this long held precaution makes no sense, because you don't even save up much space (the large packages are TF and its dependents, not the stuff in the main venv).

Same is true for headless-torch14 BTW (which I saw you attempt to remove in some other branch).

@mikegerber

If ocrd_cis requires calamari 0.3.5 then it should probably go to the environment supporting TF 1.x.

Yes, but that old version does not depend on TF though (only protobuf). So my take is still that this should be removed from ocrd_cis soon.

What are you talking about? ocrd_cis is one of the most important modules we have.

I misunderstood your "will likely be removed soon". Then let's wait until for the updates of ocrd_cis and ocrd_calamari which hopefully will fix the conflict and allow closing this issue.

Then let's wait until for the updates of ocrd_cis and ocrd_calamari which hopefully will fix the conflict and allow closing this issue.

I don't see any conflict. They reside in different venvs.

If you hope to achieve all-in-one (fat) builds without sub-venvs, then let me reiterate that this is fruitless: there will always be some conflicts around the corner in this evolving federated system. We'll need to manage sub-venvs as long as we have no other isolation mechanism (think: networking).

If you hope to achieve all-in-one (fat) builds without sub-venvs, then let me reiterate that this is fruitless: there will always be some conflicts around the corner in this evolving federated system. We'll need to manage sub-venvs as long as we have no other isolation mechanism (think: networking).

100% agree with this. You'll always have conflicting dependencies and it's best to have a solution that works with this fact - separate venvs or containers.

For the current ocrd_all submodules it is sufficient to have two separate venvs (one main venv, one sub-venv). That works without dependency conflicts for different versions of Ubuntu / Debian and Python3 and minimizes build time and storage requirements: https://github.com/stweil/ocrd_all/actions/runs/2029562843.