S3 reader always allocates a GPU (and crashes if none is available)

Question

S3 reader always allocates a GPU (and crashes if none is available)

Closed this issue in 14 hours · 4 comments

Version

1.38.0

Describe the bug.

Hi,

I'm testing the S3 reader and I noticed that it always requires an available GPU to work, i.e., on machine without GPUs it crashes with this error:

dlopen libnvidia-ml.so failed!. Please install GPU dirver[/opt/dali/dali/util/nvml_wrap.cc:69] nvmlInitChecked failed:
[...]

Would it be possible to make it work also in contexts where there's no GPU available, similar to how the regular file reader operates?

Thanks!

Minimum reproducible example

No response

Relevant log output

No response

Other/Misc.

No response

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

Answer 1 · 2024-06-18T11:05:24.000Z

Hi @fversaci,

Thank you for reaching out. It is not expected. Can you provide a minimal repro we can run on our end so we are on the same page?

Answer 2 · 2024-06-18T11:33:00.000Z

Sure, here's a minimal file reader. If run in a docker container with no GPUs, that's the behaviour:

python3 file_read.py /imagenet/train   # this works
python3 file_read.py s3://imagenet/train   # this crashes at start-up, looking for a GPU

######
# file_read.py
######
# dali
from nvidia.dali.pipeline import pipeline_def
from nvidia.dali.plugin.base_iterator import LastBatchPolicy
from nvidia.dali.plugin.pytorch import DALIGenericIterator
import nvidia.dali.fn as fn
import nvidia.dali.types as types

# varia
from clize import run
from tqdm import trange, tqdm
import math
import os
import time

global_rank = int(os.getenv("RANK", default=0))
local_rank = int(os.getenv("LOCAL_RANK", default=0))
world_size = int(os.getenv("WORLD_SIZE", default=1))


def read_data(
    file_root=None,
    *,
    use_gpu=False,
    epochs=3,
):
    """Read images from filesystem, in a tight loop

    :param use_gpu: enable output to GPU (default: False)
    :param file_root: File root to read from
    """
    if use_gpu:
        device_id = local_rank
    else:
        device_id = types.CPU_ONLY_DEVICE_ID

    bs = 128
    file_reader = fn.readers.file(
        file_root=file_root,
        name="Reader",
        shard_id=global_rank,
        num_shards=world_size,
        # pad_last_batch=True,
        # speed up reading
        prefetch_queue_depth=2,
        dont_use_mmap=True,
        read_ahead=True,
    )

    # create dali pipeline
    @pipeline_def(
        batch_size=bs,
        num_threads=4,
        device_id=device_id,
        prefetch_queue_depth=2,
    )
    def get_dali_pipeline():
        images, labels = file_reader
        if device_id != types.CPU_ONLY_DEVICE_ID:
            images = images.gpu()
            labels = labels.gpu()
        return images, labels

    pl = get_dali_pipeline()
    pl.build()

    ########################################################################
    # DALI iterator
    ########################################################################
    # produce images
    shard_size = math.ceil(pl.epoch_size()["Reader"] / world_size)
    steps = math.ceil(shard_size / bs)
    for _ in range(epochs):
        # read data for current epoch
        for _ in trange(steps):
            pl.run()
        pl.reset()


# parse arguments
if __name__ == "__main__":
    run(read_data)

Answer 3 · 2024-06-18T13:48:25.000Z

Hi @fversaci,

Thank you for the repro.
It is indeed a bug. #5533 should fix it.
Please check the nightly build once it is merged.

Answer 4 · 2024-06-28T20:17:54.000Z

I confirm it's fixed, thanks!