Segmentation fault when using 'mixed'

Version

nvidia-dali-cuda110 1.35.0

Describe the bug.

When running a dali pipeline with device="mixed" or eager with device="gpu" I get a segmentation fault. I'm running on a docker and in a conda env. If I use device="cpu" the code works without issue. I also tried this on a different conda env with cuda 12 with the same result.

Sidenote: the eager "gpu" device is broken as it tries to check for a ._mixed_ops property that doesn't exist anymore I guess:

DALI/dali/python/nvidia/dali/_utils/eager_utils.py

Line 605 in 717d704

if op_name in _ops._mixed_ops:

I just removed that check and set device = "mixed" for my experiment

Minimum reproducible example

import nvidia.dali as dali
import nvidia.dali.fn as fn
from nvidia.dali.experimental import eager
from nvidia.dali import tensors
from nvidia.dali import pipeline_def
import numpy as np

pipe = None
def load_image_dali(image_file: str, use_eager=True) -> np.ndarray:
    if use_eager:
        img_data = np.fromfile(image_file, dtype=np.uint8)
        sample = tensors.TensorCPU(img_data)
        img_data_list = tensors.TensorListCPU([sample])
        images = eager.decoders.image(img_data_list, device="gpu")
    else:
        batch_size = 1

        @pipeline_def()
        def image_decoder_pipeline(device="mixed"):
            img_data = fn.external_source(name="img_data")
            return fn.decoders.image(img_data, device=device, use_fast_idct=False)

        global pipe
        if pipe is None:
            pipe = image_decoder_pipeline(device="mixed", batch_size=batch_size, num_threads=1, device_id=0,
                                          prefetch_queue_depth=1)
            pipe.build()
        img_data = np.fromfile(image_file, dtype=np.uint8)
        (images,) = pipe.run(img_data=[img_data])

    if isinstance(images, dali.tensors.TensorListGPU):
        images = images.as_cpu()

    image = np.asanyarray(images[0])

    return image

def main():
    # Image is from: https://upload.wikimedia.org/wikipedia/commons/b/b4/JPEG_example_JPG_RIP_100.jpg
    # But I tried with a few jpgs, same issue
    image_file = "JPEG_example_JPG_RIP_100.jpg"
    image = load_image_dali(image_file, use_eager=False)
    print(image.shape)

if __name__ == "__main__":
    main()

Relevant log output

# With device="mixed"
Segmentation fault (core dumped)

# With device="cpu"
(234, 313, 3)

Other/Misc.

# nvcc  --version output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

# nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   22C    P0    50W / 300W |      0MiB / 23028MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

Hi @usbhub
thank you for reporting the issue. It looks like we had some problems with the experimental eager operators test coverage, we will fix that.

As for the seagfault, we are able to run both versions of the test with GPU on my end (with the fix for _mixed_ops). We tried it in several environments, with driver 525.147.05 and 545.29.06.

Can you test if anything related to GPU in DALI works, and if you can ran any CUDA code in your environment?

This is the simplest pipeline that does a copy to GPU:

from nvidia.dali import fn, pipeline_def, types
import numpy as np


def main():
    @pipeline_def(batch_size=1, num_threads=1, device_id=0)
    def pipe():
        constant = types.Constant(np.full((2, 2), 42))
        return constant.gpu()

    p = pipe()
    p.build()
    print(p.run()[0])

if __name__ == "__main__":
    main()

Can you share what docker image and the environment inside it you use, so we can do a full repro?
Something like:

docker run --gpus all --rm -ti ubuntu:22.04

apt update && apt install -y vim wget python3-pip
pip install --extra-index-url https://pypi.nvidia.com/ --upgrade nvidia-dali-cuda110 numpy
wget https://upload.wikimedia.org/wikipedia/commons/b/b4/JPEG_example_JPG_RIP_100.jpg
python3 test.py

Thank you for the quick response,

That basic pipeline you sent works and prints

TensorListGPU(
    [[[42 42]
      [42 42]]],
    dtype=DALIDataType.INT32,
    num_samples=1,
    shape=[(2, 2)])

The docker I'm testing on is custom, though I tried on the base it was derived from: nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
and that worked. So either the base got updated after when I built my custom docker in a way that fixed the issue, or something I did in my custom docker broke it. I also tried outside of the conda env with the same issue, so I think it is docker/library related and not something in conda. Is there any way I can narrow down what might be causing the issue? Maybe checking library versions for mismatches?

Hi, after digging a bit I've found the issue. I was installing the library decord inside the docker and in order to build it requires the nvidia video codec sdk (build error for reference below). I had downloaded that and put the libnvcuvid.so and libnvidia-encode.so inside my docker cuda folder. This was causing the issue, possibly due to a mismatch with the driver. After removing those files it seems to run as expected. Sorry for the mixup, not a library issue.

This is unrelated to this library, but I wonder do you have any thoughts on how to build a library that requires the video codec inside the docker without causing issues? After some googling I saw that you can install these packages: sudo apt install libnvidia-decode-525 libnvidia-encode-525 but you need to know the driver version, which you wouldn't ahead of time in a docker. Any suggestions are appreciated, thank you.

Decord build error:

-- Found CUDA_NVCUVID_LIBRARY=CUDA_NVCUVID_LIBRARY-NOTFOUND
CMake Error at cmake/modules/CUDA.cmake:33 (message):
  Cannot find libnvcuvid, you may need to manually register and download at
  https://developer.nvidia.com/nvidia-video-codec-sdk.  Then copy libnvcuvid
  to cuda_toolkit_root/lib64/
Call Stack (most recent call first):
  CMakeLists.txt:92 (include)

I've found that you need to explicitly give the 'video' capability to get libnvcuvid.so when running the docker, for example like this:

docker run -it --rm --runtime nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,video nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04 sh -c 'ldconfig -p | grep cuvid'

Refs:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#driver-capabilities
blakeblackshear/frigate#5166 (comment)
NVIDIA/nvidia-docker#1001 (comment)
It seems from this answer you can also set it during build, though I haven't tested this yet: https://stackoverflow.com/a/77348905

I think this can be resolved, thank you for the help.

Hi @usbhub,

Yes, exactly that. Exposing video capability inside docker is a way to go. You can also see ENV NVIDIA_DRIVER_CAPABILITIES video,compute,utility during the docker build time. Both ways should work (the second takes away the responsibility of remembering how to run docker from the user so it is a bit more deployment-friendly).