S3 reader always allocates a GPU (and crashes if none is available)
Closed this issue · 4 comments
fversaci commented
Version
1.38.0
Describe the bug.
Hi,
I'm testing the S3 reader and I noticed that it always requires an available GPU to work, i.e., on machine without GPUs it crashes with this error:
dlopen libnvidia-ml.so failed!. Please install GPU dirver[/opt/dali/dali/util/nvml_wrap.cc:69] nvmlInitChecked failed:
[...]
Would it be possible to make it work also in contexts where there's no GPU available, similar to how the regular file reader operates?
Thanks!
Minimum reproducible example
No response
Relevant log output
No response
Other/Misc.
No response
Check for duplicates
- I have searched the open bugs/issues and have found no duplicates for this bug report
JanuszL commented
Hi @fversaci,
Thank you for reaching out. It is not expected. Can you provide a minimal repro we can run on our end so we are on the same page?
fversaci commented
Sure, here's a minimal file reader. If run in a docker container with no GPUs, that's the behaviour:
python3 file_read.py /imagenet/train # this works
python3 file_read.py s3://imagenet/train # this crashes at start-up, looking for a GPU
######
# file_read.py
######
# dali
from nvidia.dali.pipeline import pipeline_def
from nvidia.dali.plugin.base_iterator import LastBatchPolicy
from nvidia.dali.plugin.pytorch import DALIGenericIterator
import nvidia.dali.fn as fn
import nvidia.dali.types as types
# varia
from clize import run
from tqdm import trange, tqdm
import math
import os
import time
global_rank = int(os.getenv("RANK", default=0))
local_rank = int(os.getenv("LOCAL_RANK", default=0))
world_size = int(os.getenv("WORLD_SIZE", default=1))
def read_data(
file_root=None,
*,
use_gpu=False,
epochs=3,
):
"""Read images from filesystem, in a tight loop
:param use_gpu: enable output to GPU (default: False)
:param file_root: File root to read from
"""
if use_gpu:
device_id = local_rank
else:
device_id = types.CPU_ONLY_DEVICE_ID
bs = 128
file_reader = fn.readers.file(
file_root=file_root,
name="Reader",
shard_id=global_rank,
num_shards=world_size,
# pad_last_batch=True,
# speed up reading
prefetch_queue_depth=2,
dont_use_mmap=True,
read_ahead=True,
)
# create dali pipeline
@pipeline_def(
batch_size=bs,
num_threads=4,
device_id=device_id,
prefetch_queue_depth=2,
)
def get_dali_pipeline():
images, labels = file_reader
if device_id != types.CPU_ONLY_DEVICE_ID:
images = images.gpu()
labels = labels.gpu()
return images, labels
pl = get_dali_pipeline()
pl.build()
########################################################################
# DALI iterator
########################################################################
# produce images
shard_size = math.ceil(pl.epoch_size()["Reader"] / world_size)
steps = math.ceil(shard_size / bs)
for _ in range(epochs):
# read data for current epoch
for _ in trange(steps):
pl.run()
pl.reset()
# parse arguments
if __name__ == "__main__":
run(read_data)
JanuszL commented
fversaci commented
I confirm it's fixed, thanks!