matifali/dockerdl

FROM matifali/dockerdl-base:11.7.1 still picks image with CUDA 12.1.1

realalexgalenko opened this issue · 5 comments

Describe the bug

TF Image is still using CUDA 12.1.1. Looks like this line didn't do what was expected

FROM matifali/dockerdl-base:11.7.1

I also tried 11.8.0 per your commit earlier. Maybe this line forces to use 12.1.1

matrix.CUDA_VER == '12.1.1'?

in docker-publish-base.yml

To Reproduce
Pull latest tf image. bash into it - message still says CUDA 12.1.1 is being used

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Expected behavior
CUDA Version 11.7.1 is used

Can you try again after you do

docker pull matifali/dockerdl:tf

Here is full output:

(base) jupyter@testing:~/tf-docker-image$ docker pull matifali/dockerdl:tf
tf: Pulling from matifali/dockerdl
Digest: sha256:db9214408f8752c6b919f777ef8abb5d42a2af41c21b9d7425f7df9086e30fc6
Status: Image is up to date for matifali/dockerdl:tf
docker.io/matifali/dockerdl:tf
(base) jupyter@testing:~/tf-docker-image$ sudo docker run --rm --runtime=nvidia --gpus all matifali/dockerdl:tf python3 -c 'import tensorflow as tf; devices = [d.device_type for d in tf.config.list_physical_devices()]; print("Available devices: %s", devices)'

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2023-07-14 21:57:04.212337: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-14 21:57:04.260491: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-14 21:57:04.261116: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-14 21:57:05.372528: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-07-14 21:57:08.703336: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-14 21:57:08.735702: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Available devices: %s ['CPU']

Here is small image from base (FROM matifali/dockerdl-base:11.7.1) showing its not using proper cuda version

(base) jupyter@testing:~/tf-docker-image$ sudo docker build --no-cache --progress=plain -t my-tf-image .
Sending build context to Docker daemon  3.584kB
Step 1/1 : FROM matifali/dockerdl-base:11.7.1
 ---> 3c5700350377
Successfully built 3c5700350377
Successfully tagged my-tf-image:latest
(base) jupyter@testing:~/tf-docker-image$ sudo docker run --rm --runtime=nvidia --gpus all my-tf-image nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Fri Jul 14 22:03:00 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

Hi @realalexgalenko, thanks for your patience. You should be able to use GPU now when you do

docker pull matifali/dockerdl:tf
docker run --rm --gpus all  matifali/dockerdl:tf python3 -c 'import tensorflow as tf; devices = [d.device_type for d in tf.config.list_physical_devices()]; print("Available devices: %s", devices)'

Yep. Worked as expected. Thank you!