Incompatibility between OpenFL and Nvidia Edge devices with L4T image
Closed this issue ยท 4 comments
Issue Summary:
I am currently working on federated learning tasks using OpenFL on NVIDIA Jetson devices. However, I am facing an issue where the model training is not utilizing the GPU, despite using a compatible version of TensorFlow. Specifically, after downgrading TensorFlow to a version that is compatible with the OpenFL framework, the training still does not use the GPU resources.
The root cause seems to be related to a mismatch between the versions of TensorFlow and CUDA. TensorFlow relies on specific versions of CUDA and cuDNN to enable GPU acceleration during training. When these versions are mismatched or incompatible, TensorFlow will not be able to use the GPU, even if a compatible version of TensorFlow is installed.
The challenge I am facing is that the Jetson devices are running an L4T-based Docker image, which comes with a pre-installed version of CUDA. This version is tightly integrated with the operating system and the NVIDIA hardware. Downgrading or changing the CUDA version is not a viable solution, as it could break compatibility with the existing system and cause instability. The pre-configured L4T container and its version of CUDA cannot be modified easily, making it difficult to align the required versions of TensorFlow and CUDA.
I am seeking advice or potential solutions to resolve this issue without the need to downgrade or modify the CUDA version on the Jetson devices.
The CUDA version in the container:
root@ubuntu:/app# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:08:11_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0
The TensorFlow version pre-installed in the L4T container:
tensorflow 2.14.0+nv23.11
With the "tensorflow 2.14.0+nv23.11", when you run the OpenFL, there will be a error :
Segmentation fault (core dumped)
So I downgraded TensorFlow to TensorFlow 2.13.0, then I was able to start the train, however, in this case, the training can' t be performed with GPU, possibly the incompatibility between the TensorFlow and CUDA.
Hi @HubgitCCL, in an attempt to isolate the issue, have you tried running your training script with tensorflow 2.14.0+nv23.11
independently of OpenFL (on a single machine, using a single dataset shard)?
Hi @teoparvanov, thank you for your advice, I tried to run my openfl code with tensorflow 2.14 and succeed,
then I tried to run the following command:
import tensorflow as tf; print('GPU Available: ', tf.config.list_physical_devices('GPU'))"'))
in the container on the device, and get results like:
double free or corruption (out)
which indicate the incompatibility should have nothing to do with Openfl.
Thanks for your help.
Hi @teoparvanov, I figured out that some of OpenFL's dependencies were causing the incompatibility issues. To resolve this, I ran the following commands:
pip install --no-deps openfl
pip install --no-deps click
pip install --no-deps rich
pip install --no-deps dynaconf
pip install --no-deps tqdm
pip install --no-deps tensorboardx
After doing this, I was able to train using the GPU. Thanks so much for your kind assistance, @teoparvanov!
Thanks for the update, @HubgitCCL, I'm glad that you were able to run your experiment with GPU acceleration. From our side we have started a comprehensive effort of updating the TensorFlow-based task runner dependencies. This will take some time, but I expect us to start making tangible progress in this regard over the upcoming 1.7 and 1.8 releases of OpenFL.
CC: @tanwarsh @kta-intel