DDMAL/Rodan

rodan production hardware problem

Opened this issue · 0 comments

related issue and description in #1227

The good news is we probably don't need to upgrade our tensorflow version.

I tested a few configurations and I think it is a problem with the generic kernel on Ubuntu. I recalled that Docker DNS resolution is not working properly as we need with the kvm (compact) kernel on Ubuntu 20.04 with the default vGPU instance, and since our old disk is running Ubuntu, we had to manually install and force the generic kernel. Unfortunately, this does not work with the current vGPU driver, which might be designed for the compact kernel.

On Ubuntu 20 with kvm kernel, we pulled the latest GPU docker image and run the container and get inside:

>>> import tensorflow as tf
2024-11-25 12:44:57.484456: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.config.list_physical_devices('GPU')
2024-11-25 12:45:04.234789: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2024-11-25 12:45:04.251141: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-25 12:45:04.251784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:00:05.0 name: GRID V100D-8C computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 836.37GiB/s
2024-11-25 12:45:04.251837: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2024-11-25 12:45:04.260772: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2024-11-25 12:45:04.260853: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2024-11-25 12:45:04.263562: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2024-11-25 12:45:04.264215: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2024-11-25 12:45:04.266341: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2024-11-25 12:45:04.268310: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2024-11-25 12:45:04.268689: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2024-11-25 12:45:04.268889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-25 12:45:04.269533: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-25 12:45:04.270033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

This is working as expected.

Plan:

  1. I will try to force rebuild dkms on current production server and see if that could solve the problem.
  2. I recommend going back to Debian 12 for OS since we have tested the complete set up in the summer with two instances.
  3. We need to find a way to efficiently migrate our data. We might use the current disk as an extra server just for data storage. I remember with Debian 12, the Docker volume can be placed on a mounted address from another server so it is quite flexible.

Warning: I will try to test each step separately first before I deploy the whole thing. So we will have the running Rodan prod with no vGPU computing (but all other jobs). And then there will be around a week or two to fully deploy the new server, probably during the end-of-year break or January.