OCR-D/ocrd_all

ocrd_all - Release v2023-06-14 - issue with GPU

Closed this issue ยท 14 comments

Hi,

I just have installed the ocrd_all Release v2023-06-14, and it looks like I have an issue with GPU/CUDA.

Hint: I use Ubuntu 22.04.1

In detail:
I have downloaded latest version with:

cd ~/ocrd_all
git pull

Then I have made:

sudo make deps-cuda

I have created a new VENV like this:

cd ~
python 3.8 - m venv ocrd-3.8

(Remark: Python 3.8 was already available using "deadsnakes" repo)

Next I have done the "main make" like this:

source ~/ocrd-3.8/bin/activate
cd ~/ocrd_all
make all CUDA_VERSION=11.8

--> I have used this "CUDA_VERSION" parameter, as on my system somehow CUDA-Version 11.6 is the default, and I have seen, that CUDA-Version 11.8 was installed with sudo make deps-cuda.

This has run successfully with one remark:
Somewhere in between I have seen this error message:

...
Synchronizing submodule url for 'ocrd_fileformat/repo/ocr-fileformat/vendor/xsd-validator'
if git submodule status --recursive ocrd_fileformat | grep -qv '^ '; then \
        sem -q --will-cite --fg --id ocrd_all_git git submodule update --init --recursive  ocrd_fileformat && \
        touch ocrd_fileformat; fi
fatal: failed to recurse into submodule 'ocrd_fileformat'
Submodule path 'ocrd_fileformat': checked out '4e7e0de68e2a0dcd9b238f64d1657beda0d74da7'
Submodule path 'ocrd_fileformat/repo/ocr-fileformat': checked out 'f550411669a7c807800d2e9f5649e10871c7f172'
Submodule path 'ocrd_fileformat/repo/ocr-fileformat/vendor/page-to-alto/repo/assets': checked out '4b3ba753bfd005457221880282c3c0a2afe1de98'
Submodule path 'ocrd_fileformat/repo/ocr-fileformat/vendor/page-to-alto/repo/page-alto-resources': checked out '9e0222ed51ea8dc5bff4c4b07855e6b09796ae00'
Submodule path 'ocrd_fileformat/repo/ocr-fileformat/vendor/textract2page': checked out '0a1f2b78760237e4fba3298b873fc33c905929b1'
Submodule 'vendor/textract2page' (https://github.com/slub/textract2page.git) registered for path 'ocrd_fileformat/repo/ocr-fileformat/vendor/textract2page'
...

--> Nevertheless this make has run through successfully.
(but maybe I have overlooked more "hidden" errors like this).

If I now do make test-cuda (in this VENV) I get this error:
(I get the same error without using parameter "CUDA_VERSION")

(ocrd-3.8) gputest@linuxgputest2:~/ocrd_all$ make test-cuda  CUDA_VERSION=11.8
. /home/gputest/ocrd-3.8/bin/activate && python3 -c "from shapely.geometry import Polygon; import torch; torch.randn(10).cuda()"
. /home/gputest/ocrd-3.8/bin/activate && python3 -c "import torch, sys; sys.exit(0 if torch.cuda.is_available() else 1)"
. /home/gputest/ocrd-3.8/bin/activate && python3 -c "import tensorflow as tf, sys; sys.exit(0 if tf.test.is_gpu_available() else 1)"
2023-06-20 12:00:00.743298: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-20 12:00:01.271903: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From <string>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-06-20 12:00:02.009878: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-20 12:00:02.033547: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-20 12:00:02.033760: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-20 12:00:02.432537: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-20 12:00:02.432752: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-20 12:00:02.432912: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-20 12:00:02.433072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 6137 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050, pci bus id: 0000:01:00.0, compute capability: 8.6
. /home/gputest/ocrd-3.8/sub-venv/headless-tf1/bin/activate && python3 -c "import tensorflow as tf, sys; sys.exit(0 if tf.test.is_gpu_available() else 1)"
2023-06-20 12:00:02.878494: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
2023-06-20 12:00:03.468269: I tensorflow/core/platform/profile_utils/cpu_utils.cc:109] CPU Frequency: 2899885000 Hz
2023-06-20 12:00:03.468695: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x41f9530 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-06-20 12:00:03.468709: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-06-20 12:00:03.469928: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-06-20 12:00:03.519608: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-20 12:00:03.519827: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x32a10e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-06-20 12:00:03.519841: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3050, Compute Capability 8.6
2023-06-20 12:00:03.519990: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-20 12:00:03.520106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties:
name: NVIDIA GeForce RTX 3050 major: 8 minor: 6 memoryClockRate(GHz): 1.777
pciBusID: 0000:01:00.0
2023-06-20 12:00:03.520124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-06-20 12:00:03.521351: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/gputest/ocrd-3.8/sub-venv/headless-tf1/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLt_for_cublas_HSS, version libcublasLt.so.11; LD_LIBRARY_PATH: /usr/local/cuda-11.6/lib64:
2023-06-20 12:00:03.523105: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-06-20 12:00:03.523309: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-06-20 12:00:03.526449: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2023-06-20 12:00:03.527013: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-06-20 12:00:03.527138: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-06-20 12:00:03.527162: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1692] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-06-20 12:00:03.527190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-06-20 12:00:03.527196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215]      0
2023-06-20 12:00:03.527199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0:   N
make: *** [Makefile:741: test-cuda] Error 1

At least:
My standard test workflow works find (but does not use the GPU).

Release v2023-03-26

that's before the recent fixes in #362 and OCR-D/core#1041

Please use the most recent version.

Also, IMO the report should go to ocrd_all repo, not here. (There's a test-cuda target there, too.)

Sorry, for using wrong repo ...
Concerning using of "most recent version".
I have made "git pull" today.
So, maybe the recent version is not yet published ?

ok, I guess you did use the current version after all. From the title of the issue it sounded like an older checkout. (Current tip is d8cdeec. The most recent tag is v2023-06-14. I have no idea where your v2023-03-26 comes from...)

In detail: I have downloaded latest version with:

cd ~/ocrd_all
git pull

Then I have made:

sudo make deps-cuda

That's not the correct procedure after an update, though. You first have to make sure your submodules are up to date. Doing make all would have implied that, but deps-cuda only implies an update of core โ€“ but under the sudo privileges, this will likely chown parts of .git. Therefore the README recommends doing make ocrd (without sudo) before make deps-cuda.

Can you check with git submodule status and find .git -user 0?

I have used this "CUDA_VERSION" parameter, as on my system somehow CUDA-Version 11.6 is the default, and I have seen, that CUDA-Version 11.8 was installed with sudo make deps-cuda.

Yes, that's the best way to do it. The version identifier recipe in ocrd_detectron2 picks whatever matches first, unless using this override. Alas, at the moment we cannot guarantee that ocrd_kraken and ocrd_typegroups_classifier (which also depend on Pytorch) do not overwrite with their version. Best way to check is make test-cuda.

This has run successfully with one remark: Somewhere in between I have seen this error message:

...
Synchronizing submodule url for 'ocrd_fileformat/repo/ocr-fileformat/vendor/xsd-validator'
if git submodule status --recursive ocrd_fileformat | grep -qv '^ '; then \
        sem -q --will-cite --fg --id ocrd_all_git git submodule update --init --recursive  ocrd_fileformat && \
        touch ocrd_fileformat; fi
fatal: failed to recurse into submodule 'ocrd_fileformat'

Sounds like an issue with your checkout. It could be the ownership problem alluded to above, or some previous failure. Strange though that despite git's complaint, it does not exit with error and does in fact recurse...

make test-cuda CUDA_VERSION=11.8

That variable is only relevant at install time โ€“ it has no effect on the test itself.

2023-06-20 12:00:03.521351: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/gputest/ocrd-3.8/sub-venv/headless-tf1/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLt_for_cublas_HSS, version libcublasLt.so.11; LD_LIBRARY_PATH: /usr/local/cuda-11.6/lib64:

ok, so apparently somewhere in your environment you have an dynamic linker override variable LD_LIBRARY_PATH, which points to CUDA 11.6, so our 11.8 (installed by deps-cuda via ld.so.conf rules) has no chance. This was a deliberate choice BTW: we don't want to be intrusive (and setting LD_LIBRARY_PATH is as intrusive as it gets).

Could you please check where that variable is coming from? (I.e. bashrc or profile or venv..., and who installed that, e.g. CUDA installer script, or manual)

Many thanks @bertsky for your detailed comments.
And, I am very sorry for any confusion I might have created with putting the wrong Release Name. This was just a stupid copy&paste error.
Of course, I have used v2023-06-14.
I have updated issue title and my first comment accordingly.

I will follow your advice and come back with another feedback ...

git submodule status creates this list:

 2c4b1ffc123e867cc5e5203970996bfb05075397 cor-asv-ann (v0.1.2-99-g2c4b1ff)
-076e04ef882bbed0b5b70e6a6a461940b82bb404 cor-asv-fst
 670862493408008441963a739ef650c6d3fa122d core (v2.33.0-790-g670862493)
 35be58cb9456b0893bc46640b234912148621fb6 dinglehopper (remotes/origin/HEAD)
 a7ffdda68a4c9c4e0b0494e7b0f865d92297ac30 docstruct (heads/master)
 706433c5049c63c6e16fee5f71d81a7e507abe06 eynollah (v0.2.0-7-g706433c)
 9615db1920cb8e15a38427333b41cdbee8baf4b6 format-converters (heads/master)
 cf7c60f898039d765984a7eb8704e7e0fbe6c88d nmalign (v0.0.3-7-gcf7c60f)
 5978a1fef1b5b863f71e0a9abd1ff8668876c661 ocrd_anybaseocr (v1.9.0-2-g5978a1f)
 3a029ca512cec911aa32f7156c831c0cca75543f ocrd_calamari (v1.0.5-11-g3a029ca)
 a0ea0a2a4aeea99414c08ae543585b994f9ab0d5 ocrd_cis (v0.0.10-149-ga0ea0a2)
 04bf4c6d325ca383671e463543ffe132f3b70f19 ocrd_detectron2 (v0.1.7-17-g04bf4c6)
 a95f8e77886c9860101392d088742ca0af277945 ocrd_doxa (v0.0.2)
 4e7e0de68e2a0dcd9b238f64d1657beda0d74da7 ocrd_fileformat (v0.5.0-15-g4e7e0de)
 105697f589839cc14d8a1e3be939598e2be1b06f ocrd_im6convert (v0.0.5)
 9e3f5a06b8efb706f8f1ac1c172fa5809ad6bab9 ocrd_keraslm (0.3.1-33-g9e3f5a0)
 b13dd8a932b7dfbfe5019698e87542f5f767e2bd ocrd_kraken (v0.3.0-21-gb13dd8a)
 0f64f07635875bc75a53365e425870858b0d388a ocrd_neat (v0.0.1)
-a6e556ec182bb18b755bfd818e7f72326b5819fa ocrd_ocropy
 6bcbb4bbb6847e581bdb84aa1c2c32b632d083f2 ocrd_olahd_client (v0.0.2)
 dbef5340432a0a138f6cd07e3e321a2fa5e658e2 ocrd_olena (v1.3.0)
 4f4a330c97208635e7b304cfce4db9e937fefd2b ocrd_pagetopdf (v1.0.0-12-g4f4a330)
-ead3fdd19c9dceb69499d8e2267e71b9cd3bcd2c ocrd_pc_segmentation
 c898d6ce2de46abc06d1f88b4b919b768d073c41 ocrd_repair_inconsistencies (heads/master)
 3c63e21b168b83bbb02caf4ce212db94447a5f4b ocrd_segment (v0.1.21-9-g3c63e21)
 09d1e13cdaf056c8542a7adbbc9b9927e2a54d2b ocrd_tesserocr (v0.2.2-454-g09d1e13)
 a78a85f57f27a28f01dd125e67d0e7676a1c7566 ocrd_typegroups_classifier (v0.5.0)
 2cd800d9eccbc084751558a87972ac22ee60e87a ocrd_wrap (v0.1.8)
-474a1cc0ebf2086c596b60c050a9e1af658ff380 opencv-python
 010ec99d2a666c363efb7e50c1eb2423857ff092 sbb_binarization (v0.1.0)
 1569e5080810f4652b720bcd344026a9b236ec50 tesseract (5.3.0-46-g1569e508)
 e184c62becd1c3c87c0546c9df506d639de8478d tesserocr (v2.1.2-127-ge184c62)
 5aff777c761cae1b6f9d954fb80f9b212e8fab92 workflow-configuration (remotes/origin/HEAD)

and find .git -user 0 is empty

gputest@linuxgputest2:~/ocrd_all$ find .git -user 0
gputest@linuxgputest2:~/ocrd_all$

Concerning LD_LIBRARY_PATH you are right - is points to Version 11.6:

gputest@linuxgputest2:~/ocrd_all$ echo $LD_LIBRARY_PATH
/usr/local/cuda-11.6/lib64:
gputest@linuxgputest2:~/ocrd_all$

Yes, it is set in .bashrc :

gputest@linuxgputest2:~$ grep LD_LIBRARY_PATH .bashrc
export LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64:$LD_LIBRARY_PATH
gputest@linuxgputest2:~$

Well, of course I could change this, but I have no idea, where the 11.8 version was installed to - see:

gputest@linuxgputest2:/usr/local$ ls -ld cuda*
lrwxrwxrwx  1 root root   22 Mar 29  2022 cuda -> /etc/alternatives/cuda
lrwxrwxrwx  1 root root   25 Mar 29  2022 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 16 root root 4096 Mar 29  2022 cuda-11.6

For both folders cudaand cuda-11 I only can find version 11.6, e.g.:

gputest@linuxgputest2:/etc/alternatives/cuda-11/lib64$ ll  libnppidei*
lrwxrwxrwx 1 root root       16 Mar  9  2022 libnppidei.so -> libnppidei.so.11
lrwxrwxrwx 1 root root       22 Mar  9  2022 libnppidei.so.11 -> libnppidei.so.11.6.3.9
-rw-r--r-- 1 root root  9659544 Mar  9  2022 libnppidei.so.11.6.3.9
-rw-r--r-- 1 root root 10209110 Mar  9  2022 libnppidei_static.a

A find also does not provide a hint, where I can find version 11.8:

gputest@linuxgputest2:/$ find . -name "libnvjpeg.so.11.8*" 2>&1 | grep -v "Permission denied"
gputest@linuxgputest2:/$

--> So, please tell me, where I can find the version 11.8

Anyway, I will do re-install now, following your @bertsky advises from above.

Hmm, not good ...
I have simply made:
Another git pull in directory ~/ocrd_all, which has a bit surprisingly given a few new files:
("surprisingly", because I have assumed, that I only get the data of the newest official release (here "v2023-06-14") and not any "random" new data - looks like my assumption is wrong?!)

remote: Enumerating objects: 33, done.
remote: Counting objects: 100% (33/33), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 33 (delta 17), reused 26 (delta 13), pack-reused 0
Unpacking objects: 100% (33/33), 21.29 KiB | 1.42 MiB/s, done.
From https://github.com/OCR-D/ocrd_all
   d8cdeec..8a68597  master     -> origin/master
Updating d8cdeec..8a68597
Fast-forward
 .github/workflows/makeall.yml |  5 ++---
 CHANGELOG.md                  |  9 +++++++++
 Dockerfile                    | 13 +++++++++++++
 Makefile                      | 11 ++++++++---
 4 files changed, 32 insertions(+), 6 deletions(-)

Then, I just have called make ocrd - and this has created this error
(mainly I see at the beginning:
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/home/gputest/ocrd_all/venv/lib/python3.8/site-packages/PIL' - but please check also the rest of the message):

...
Successfully built ocrd-utils atomicwrites
Installing collected packages: Pillow, numpy, frozendict, atomicwrites, ocrd-utils
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/home/gputest/ocrd_all/venv/lib/python3.8/site-packages/PIL'
Check the permissions.

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/gputest/ocrd_all/core/ocrd_models
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  ร— python setup.py egg_info did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/gputest/ocrd_all/core/ocrd_models/setup.py", line 4, in <module>
          from ocrd_utils import VERSION
      ModuleNotFoundError: No module named 'ocrd_utils'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/gputest/ocrd_all/core/ocrd_modelfactory
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  ร— python setup.py egg_info did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/gputest/ocrd_all/core/ocrd_modelfactory/setup.py", line 4, in <module>
          from ocrd_utils import VERSION
      ModuleNotFoundError: No module named 'ocrd_utils'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/gputest/ocrd_all/core/ocrd_validators
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  ร— python setup.py egg_info did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/gputest/ocrd_all/core/ocrd_validators/setup.py", line 4, in <module>
          from ocrd_utils import VERSION
      ModuleNotFoundError: No module named 'ocrd_utils'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/gputest/ocrd_all/core/ocrd_network
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  ร— python setup.py egg_info did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/gputest/ocrd_all/core/ocrd_network/setup.py", line 3, in <module>
          from ocrd_utils import VERSION
      ModuleNotFoundError: No module named 'ocrd_utils'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/gputest/ocrd_all/core/ocrd
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  ร— python setup.py egg_info did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/gputest/ocrd_all/core/ocrd/setup.py", line 3, in <module>
          from ocrd_utils import VERSION
      ModuleNotFoundError: No module named 'ocrd_utils'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
make[1]: *** [Makefile:120: install] Error 1
make[1]: Leaving directory '/home/gputest/ocrd_all/core'
make: *** [Makefile:230: /home/gputest/ocrd_all/venv/bin/ocrd] Error 2

--> it looks like there simply no exists the PIL folder (or file?) here:

gputest@linuxgputest2:~/ocrd_all$ ls /home/gputest/ocrd_all/venv/lib/python3.8/site-packages/
_distutils_hack                            nvidia_cuda_nvrtc_cu117-11.7.50.dist-info     nvidia_curand_cu11-2022.4.8.dist-info      nvidia_pyindex                  setuptools
distutils-precedence.pth                   nvidia_cuda_runtime_cu11-2022.4.25.dist-info  nvidia_curand_cu117-10.2.10.50.dist-info   nvidia_pyindex-1.0.9.dist-info  setuptools-68.0.0.dist-info
nvidia                                     nvidia_cuda_runtime_cu117-11.7.60.dist-info   nvidia_cusolver_cu11-2022.4.8.dist-info    pip                             wheel
nvidia_cublas_cu11-2022.4.8.dist-info      nvidia_cudnn_cu11-8.6.0.163.dist-info         nvidia_cusolver_cu117-11.3.5.50.dist-info  pip-23.1.2.dist-info            wheel-0.40.0.dist-info
nvidia_cublas_cu117-11.10.1.25.dist-info   nvidia_cufft_cu11-2022.4.8.dist-info          nvidia_cusparse_cu11-2022.4.8.dist-info    pkg_resources
nvidia_cuda_nvrtc_cu11-2022.4.8.dist-info  nvidia_cufft_cu117-10.7.2.50.dist-info        nvidia_cusparse_cu117-11.7.3.50.dist-info  pkg_resources-0.0.0.dist-info

-> any recommendation?

Concerning LD_LIBRARY_PATH you are right - is points to Version 11.6:
Well, of course I could change this, but I have no idea, where the 11.8 version was installed to - see:

Have a look at the deps-cuda target in core/Makefile: it will

  • create a /conda with nvcc
  • make this available system-wide via /etc/profile.d (i.e. login shells)
  • install CUDA runtime libraries (including cuDNN) via nvidia-pyindex into ocrd_all's venv
  • reuse these paths system-wide via ld.so.conf

So all you need to do AFAICS (while using ocrd_all) is to suppress your LD_LIBRARY_PATH envvar (either by setting it to empty in your current shell or commenting the setting in .bashrc).

Another git pull in directory ~/ocrd_all, which has a bit surprisingly given a few new files:
("surprisingly", because I have assumed, that I only get the data of the newest official release (here "v2023-06-14") and not any "random" new data - looks like my assumption is wrong?!)

There have been merges with additional improvements, but no new release yet (which I guess is normal dev cycle, so I'm surprised you're surprised...).

Permission denied: '/home/gputest/ocrd_all/venv/lib/python3.8/site-packages/PIL'

That was I had suspected. Only strange that the find -user 0 did not catch it.

So please sudo chown -r uid:gid ~/ocrd_all to fix what went wrong last time.

(There's no need to re-do sudo make deps-cuda BTW.)

but please check also the rest of the message
any recommendation?

Looks like follow-up errors. To be on the safe side, make clean before the next make all.

Looks like I made it :-)
make test-cuda results in everything seems to be fine.
So, many thanks @bertsky for your support. I will close this issue here now.

Splendid. So to recap:

  • after updating (git pull), do a make modules before any sudo action
  • if your environment has installed CUDA via ld.so override LD_LIBRARY_PATH then that needs to be suppressed

BTW, to be on the safe side, consider running make test-workflow (i.e. coverage test) afterwards.

Concerning "recap":

  • I have made a make ocrd before `sudo make deps-cuda'
  • supressed LD_LIBRARY_PATH - correct

Concerning make test-workflow : I have my own test workflow, which runs fine (of course this tests only the basic modules I use).
Now I have called make test-workflow and get this error:

2023-06-23 11:27:29.540 INFO ocrd.cli.resmgr - Use in parameters as 'default-2021-03-09'
+ ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default-2021-03-09
2023-06-23 11:27:38.592 INFO processor.SbbBinarize - INPUT FILE 0 / PHYS_0001
2023-06-23 11:27:39.011 INFO processor.SbbBinarize - Binarizing on 'page' level in page 'PHYS_0001'
2023-06-23 11:27:39.052 INFO processor.SbbBinarize.__init__ - Predicting with model /home/gputest/.local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09/saved_model_2021_03_09/ [1/1]
2023-06-23 11:27:40.975 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-sbb-binarize'
Traceback (most recent call last):
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/sbb_binarize/ocrd_cli.py", line 113, in process
    bin_image = cv2pil(self.binarizer.run(image=pil2cv(page_image)))
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/sbb_binarize/sbb_binarize.py", line 244, in run
    res = self.predict(model, image)
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/sbb_binarize/sbb_binarize.py", line 157, in predict
    label_p_pred = model.predict(img_patch.reshape(1, img_patch.shape[0], img_patch.shape[1], img_patch.shape[2]),
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

Detected at node 'model_2/conv1/Conv2D' defined at (most recent call last):
    File "/home/gputest/ocrd-3.8/bin/ocrd-sbb-binarize", line 8, in <module>
      sys.exit(cli())

Should I create another issue for this? (or re-open this one?)

Concerning "recap":

* I have made a `make ocrd` before `sudo make deps-cuda'

yes, that's sufficient (for that task). In general, modules will ensure all updates are done. (And this is needed for sudo make deps-ubuntu, since that will also depend on all modules.)

Should I create another issue for this? (or re-open this one?)

Yes, please do. (This is new.)

https://github.com/qurator-spk/sbb_binarization/ would be best fit IMO.

Please also explain what version of the model you have installed (e.g. find /home/gputest/.local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09/saved_model_2021_03_09/ -exec md5sum {} \;)

@bertsky :
On another machine I have tried to do the same new installation of ocrd_all.
This time I have called make modules (instead of make ocrd) before sudo ....
This make modules creates this following error:

Submodule path 'ocrd_cis': checked out 'a0ea0a2a4aeea99414c08ae543585b994f9ab0d5'
From https://github.com/cisocrgroup/ocrd_cis
 * branch            a0ea0a2a4aeea99414c08ae543585b994f9ab0d5 -> FETCH_HEAD
sem -q --will-cite --fg --id ocrd_all_git git submodule sync  ocrd_detectron2
Synchronizing submodule url for 'ocrd_detectron2'
if git submodule status  ocrd_detectron2 | grep -qv '^ '; then \
        sem -q --will-cite --fg --id ocrd_all_git git submodule update --init   ocrd_detectron2 && \
        touch ocrd_detectron2; fi
error: Your local changes to the following files would be overwritten by checkout:
        ocrd_detectron2/segment.py
Please commit your changes or stash them before you switch branches.
Aborting
fatal: Unable to checkout '04bf4c6d325ca383671e463543ffe132f3b70f19' in submodule path 'ocrd_detectron2'
make: *** [Makefile:189: ocrd_detectron2] Error 1

--> should I try with make ocrd ? (or maybe you want to investigate this?)

error: Your local changes to the following files would be overwritten by checkout:
ocrd_detectron2/segment.py
Please commit your changes or stash them before you switch branches.
Aborting

Looks like you instrumented the code...

yes, you are right.
git -C ocrd_detectron2 reset --hard has helped.
(I only get this error again: #381) -> which I have ignored.