Core dump when CUDA_VISIBLE_DEVICES set to 0.
jrounds opened this issue · 5 comments
Hi,
Host is redhat os with two nvidia GPUs.
I built an environment like this everything works (CPU)
conda create -y --name py38_drjit_test python==3.8
conda activate py38_drjit_test
python3 -m pip install --upgrade pip
python3 -m pip install drjit
python3 -c "import drjit; print(drjit.__version__)" #outputs 0.4.3
Add in tensorflow/cuda according to published instructions:
# Installing tensorflow based on latest instructions
conda install -y -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.0
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
Still good this all works as expected
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # two devices
python3 -c "import drjit; print(drjit.__version__)" #outputs 0.4.3
STarted messing with CUDA_VISIBLE_DEVICES (actually started at the last one, but showing the ones that work first)
This works (2 GPUS)
export CUDA_VISIBLE_DEVICES=0,1
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # two devices
python3 -c "import drjit; print(drjit.__version__)" #outputs 0.4.3
This works (Last GPU)
export CUDA_VISIBLE_DEVICES=1
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device
python3 -c "import drjit; print(drjit.__version__)"
This core dumps (repeatable in any pattern of these combinations)
export CUDA_VISIBLE_DEVICES=0
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device
python3 -c "import drjit; print(drjit.__version__)"
Actual output
(py38_drjit_test) [host]$ export CUDA_VISIBLE_DEVICES=0
(py38_drjit_test) [host]$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device
2023-08-30 11:00:34.838109: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-30 11:00:34.880618: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-30 11:00:35.631433: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
(py38_drjit_test) [host]$ python3 -c "import drjit; print(drjit.__version__)"
Segmentation fault (core dumped)
So setting CUDA_VISIBLE_DEVICES to first gpu of 2 results in core dump of on import of drjit, but setting it to last does not have that effect?
Any advice on what to consider to work through this?
Hello @jrounds,
What are the models of the two GPUs, is there anything different between them?
What is your NVIDIA driver version?
I routinely select GPU 0 or 1 using CUDA_VISIBLE_DEVICES
and haven't encountered this crash.
Just checking, does the presence of Tensorflow in the Conda env influence the crash?
E.g. if you just create a Python virtualenv and pip install drjit
, can you reproduce the crash?
I think we started speculating there may be something on gpu0 that matters. I didn't investigate that. core dump is a less than ideal message, but it may be user error. I don't see what it could be though. Certainly nvidia-smi is not giving a clue.
GPUs: 2x Quadro RTX 6000
Driver: 535.104.05
I am going to close this because I have an effect work around and I am not convinced it isnt our machine.
If you feel like this is a problem with DrJit, the error message would likely be much more informative using a debug build (https://drjit.readthedocs.io/en/latest/firststeps-cpp.html#first-steps-in-c).
Glad to hear it! Thank you for reporting back.