ctlearn-project/ctlearn

libcublas.so ImportError on a Linux gpu environment

Closed this issue · 5 comments

riwim commented

Hi all, I'm currently working on my master thesis on gamma/hadron separation with multiple images from CTA.

Following the install instructions for a gpu environment on a Linux system I had some problems to get a working system.

python run_model.py my_config.yml --mode train

fails with an import error:

ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory.

Being inspired by tensorflow/tensorflow#26182 (comment) I changed

# conda env create -f environment-gpu.yml
name: ctlearn
channels:
    - anaconda
dependencies:
    - python=3.7.3
    - matplotlib
    - numpy
    - pandas
    - pip
    - pyyaml
    - scikit-learn
    - pip:
        # TensorFlow-GPU v1.13.1
        - https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.13.1-cp37-cp37m-linux_x86_64.whl

in environment-gpu.yml to

# conda env create -f environment-gpu.yml
name: ctlearn
channels:
    - anaconda
dependencies:
    - python=3.7.3
    - matplotlib
    - numpy
    - pandas
    - pip
    - pyyaml
    - scikit-learn
    - tensorflow
    - cudatoolkit
    - cudnn
    # - pip:
        # TensorFlow-GPU v1.13.1
        # - https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.13.1-cp37-cp37m-linux_x86_64.whl

With these changes I can now execute run_model.py.

Hi @riwim. Thanks for opening this issue. What you report is a known problem with TF 13.1, as you noticed. A way to get it fixed without upgrading you TF version is to downgrade CUDA from 10.1 to 10.0 on your machine. As of now, we have benchmarked CTLearn with TF 13.1 (i.e. we are sure the code is behaving as expected with that version) and that's why we recommend users to run with that specific version. We'll be benchmarking with TF 1.14 hopefully soon, so we can safely upgradeTF and get rid of this inconvenient incompatibility with CUDA 10.1.

riwim commented

Ok, I see.

I'm trying to get the driver situation clear:
According to

nvidia-smi
Tue Jul  9 16:36:10 2019       
NVIDIA-SMI 396.82                 Driver Version: 396.82 

our Linux x86_64 Tesla system runs the nvidia driver version 396.82 (I can't change this). Table 1. CUDA Toolkit and Compatible Driver Versions tells me that CUDA 9.2 (9.2.88) is the last version to be compatible with >= 396.26 Linux x86_64 driver versions. Table 3. CUDA Application Compatibility Support Matrix tells me that, because of the CUDA compatibility platform for Tesla systems, 396.26+ driver versions are also compatible with CUDA 10.1 but not CUDA 10.0.
So because of the TF 13.1 ↮ CUDA 10.1 and Nvidia 396.82 ↮ CUDA 10.0 incompatibilities I can't use your specific configuration right now.
Is this true? Do you see any workaround?

Assuming upgrading the version of the NVidia driver is off the table, an alternative I can see is that you install CTLearn 0.3.0 and run some benchmarking using TF 1.14 instead of TF 1.13. If there were no breaking changes to the API in TF 1.14 you should be able to train the benchmark models and compare their performances to see if they match the 0.3.0 benchmarks run with TF 1.13. I should warn you that there's some significant refactoring going on in CTLearn/DL1-Data-Handler these days, so you may want to wait for new releases of both packages, depending on the scope and timeline of your project. Feel free to contact me by email if you want to discuss those details in private.

Hey,
I also faced the same issue with the libculas file while trying to get familiar with CTLearn for GSoC 2020. After upgrading to tensorflow 1.14 and running a model gives the error of
module 'dl1_data_handler.transforms' has no attribute 'ConvertShowerPrimaryIDToClassLabel'
Any idea?

Hi @sahilyadav27. We recently renamed the transformation in dl1-data-handler. Depending on the versions, you should either use ShowerPrimaryIDToParticleType or ConvertShowerPrimaryIDToClassLabel in the config file.