Core dump when CUDA_VISIBLE_DEVICES set to 0.

Question

Core dump when CUDA_VISIBLE_DEVICES set to 0.

jrounds opened this issue a year ago · 5 comments

Hi,

Host is redhat os with two nvidia GPUs.

I built an environment like this everything works (CPU)

conda create -y --name py38_drjit_test python==3.8
conda activate py38_drjit_test
python3 -m pip install --upgrade pip
python3 -m pip install drjit
python3 -c "import drjit; print(drjit.__version__)"  #outputs 0.4.3

Add in tensorflow/cuda according to published instructions:

# Installing tensorflow based on latest instructions
conda install -y -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.0 
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Still good this all works as expected

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # two devices
python3 -c "import drjit; print(drjit.__version__)" #outputs 0.4.3

STarted messing with CUDA_VISIBLE_DEVICES (actually started at the last one, but showing the ones that work first)

This works (2 GPUS)

export CUDA_VISIBLE_DEVICES=0,1
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # two devices
python3 -c "import drjit; print(drjit.__version__)" #outputs 0.4.3

This works (Last GPU)

export CUDA_VISIBLE_DEVICES=1
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device
python3 -c "import drjit; print(drjit.__version__)"

This core dumps (repeatable in any pattern of these combinations)

export CUDA_VISIBLE_DEVICES=0
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device
python3 -c "import drjit; print(drjit.__version__)"

Actual output

(py38_drjit_test) [host]$ export CUDA_VISIBLE_DEVICES=0
(py38_drjit_test) [host]$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device

2023-08-30 11:00:34.838109: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-30 11:00:34.880618: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-30 11:00:35.631433: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
(py38_drjit_test) [host]$ python3 -c "import drjit; print(drjit.__version__)"
Segmentation fault (core dumped)

So setting CUDA_VISIBLE_DEVICES to first gpu of 2 results in core dump of on import of drjit, but setting it to last does not have that effect?

Any advice on what to consider to work through this?

Answer 1 · 2023-08-31T07:05:56.000Z

Hello @jrounds,

What are the models of the two GPUs, is there anything different between them?
What is your NVIDIA driver version?

I routinely select GPU 0 or 1 using CUDA_VISIBLE_DEVICES and haven't encountered this crash.

Just checking, does the presence of Tensorflow in the Conda env influence the crash?
E.g. if you just create a Python virtualenv and pip install drjit, can you reproduce the crash?

Answer 2 · 2023-08-31T21:04:51.000Z

I think we started speculating there may be something on gpu0 that matters. I didn't investigate that. core dump is a less than ideal message, but it may be user error. I don't see what it could be though. Certainly nvidia-smi is not giving a clue.

GPUs: 2x Quadro RTX 6000
Driver: 535.104.05

I am going to close this because I have an effect work around and I am not convinced it isnt our machine.

Answer 3 · 2023-09-02T07:54:37.000Z

If you feel like this is a problem with DrJit, the error message would likely be much more informative using a debug build (https://drjit.readthedocs.io/en/latest/firststeps-cpp.html#first-steps-in-c).

Answer 4 · 2023-09-02T18:04:05.000Z

We ended up fixing it. Actually not sure what we did. Forgot to ask. Involved a reboot. It was on our side. Thanks for your time.

…

On Sat, Sep 2, 2023 at 12:54 AM Merlin Nimier-David < ***@***.***> wrote: If you feel like this is a problem with DrJit, the error message would likely be much more informative using a debug build ( https://drjit.readthedocs.io/en/latest/firststeps-cpp.html#first-steps-in-c ). — Reply to this email directly, view it on GitHub <#185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFSJZPDH5GIM4VIKEVO7TTXYLQ4RANCNFSM6AAAAAA4E4URAQ> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

Answer 5 · 2023-09-02T18:06:19.000Z

Glad to hear it! Thank you for reporting back.