canonical/notebook-operators

Tensorflow-cuda notebook not being able to use CUDA libraries

Closed this issue · 2 comments

Bug Description

Tensorflow-cuda notebook not being able to use CUDA libraries
It is able to detect the GPU (see nvidia-smi output bellow)

Now, pytorch-cuda notebook do is able to use CUDA libraries
See output bellow

To Reproduce

Create a notebook with tensorflow cuda image and assign GPUs to it (issue)
To compare, create a notebook with pytorch cuda image and assign GPUs to it (works)

Environment

Microk8s
Kubeflow 1.7/stable
AirGapped

NVIDIA GPU Operator
NVIDIA MIG Slicing into 4 an A30 GPU

Notebook images

  • charmedkubeflow/jupyter-tensorflow-cuda-full:v1.7.0-20.04-1_ad4e002
  • charmedkubeflow/jupyter-pytorch-cuda-full:v1.7.0-20.04-1_ad4e002

Relevant Log Output

# =============== Tensorflow CUDA notebook / Terminal
(base) bash-5.0$ nvidia-smi
Fri Apr 12 23:14:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     Off | 00000000:9B:00.0 Off |                   On |
| N/A   44C    P0              56W / 165W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              12MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB /  8191MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+



# =============== Tensorflow CUDA notebook / Jupyter Notebook
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))  

2024-04-12 23:15:32.768124: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-12 23:15:32.773788: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-04-12 23:15:32.773811: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Num GPUs Available:  0
2024-04-12 23:15:35.005762: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.005840: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.005884: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.005931: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.005974: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.006017: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.006056: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.006098: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2024-04-12 23:15:35.006107: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Additional Context

Pytorch-cuda notebook do is able to use CUDA libraries

# =============== Pytorch CUDA notebook / Terminal
base) bash-5.0$ nvidia-smi
Fri Apr 12 23:19:52 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     Off | 00000000:9B:00.0 Off |                   On |
| N/A   43C    P0              56W / 165W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    4   0   0  |              15MiB /  5952MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB /  8191MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+


# =============== Pytorch CUDA notebook / Jupyter Notebook
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.device(0))
print(torch.cuda.get_device_name(0))

True
1
<torch.cuda.device object at 0x7f45399ada00>
NVIDIA A30 MIG 1g.6gb

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5570.

This message was autogenerated

The above issue is using a rock, which we no longer support, since we can't bundle CUDA in our OCI images.

We haven't manage to reproduce this issue though on the latest 1.9 with the upstream notebook images, so I'll close this one. @gustavosr98 @natalytvinova

But if you bump into this again please open a new one with

  1. the exact notebook image you are using
  2. the nodes and their GPU Operator they are using
  3. the GPU model