NVIDIA/DALI

error using webdataset

CoinCheung opened this issue · 0 comments

Version

1.31.0

Describe the bug.

Got error when using webdataset

Minimum reproducible example

Create tar with this command:

    $ tar cf sa_000199.coin.tar sa_000199/
    $ wds2idx sa_000199.coin.tar sa_000199.coin.idx

Part of my codebase is like this:

@pipeline_def
def create_dali_pipeline_segment(shard_id, num_shards, dali_cpu=False,
                                 scales=[0.75, 2], cropsize=[1024, 1024],
                                 mean=[0.3257, 0.3690, 0.3223],
                                 std=[0.2112, 0.2148, 0.2115],
                                 ):

    saroot = '../../../datasets_share_to_all/SA-1B/raw/'
    ws_paths = [
        osp.join(saroot, 'sa_000199.coin.tar'),
        #  osp.join(saroot, 'sa_000026.tar'),
        #  osp.join(saroot, 'sa_000027.tar'),
    ]
    ws_index_paths = [re.sub('tar$', 'idx', el) for el in ws_paths]
    images = fn.readers.webdataset(
        paths=ws_paths[0],
        #  index_paths=ws_index_paths,
        ext='jpg', missing_component_behavior="error",
        dtypes=[types.UINT8, ],
    )


    dali_device = 'cpu' if dali_cpu else 'gpu'
    decoder_device = 'cpu' if dali_cpu else 'mixed'
    # ask nvJPEG to preallocate memory for the biggest sample in ImageNet for CPU and GPU to avoid reallocations in runtime
    device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
    host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
    # ask HW NVJPEG to allocate memory ahead for the biggest image in the data set to avoid reallocations in runtime
    preallocate_width_hint = 5980 if decoder_device == 'mixed' else 0
    preallocate_height_hint = 6430 if decoder_device == 'mixed' else 0



    ## decode and switch to gpu
    shape = fn.peek_image_shape(images)
    images = fn.decoders.image(images, device='mixed', output_type=types.RGB)
    #  shape = fn.shapes(images)
    images = images.gpu()

    # random resize
    scale = fn.random.uniform(range=(min(scales), max(scales)))
    new_size = shape[0:2] * scale
    images = fn.resize(images, size=new_size,
                       interp_type=types.DALIInterpType.INTERP_LINEAR, antialias=False)


### Relevant log output

```shell
[/opt/dali/dali/operators/reader/loader/webdataset_loader.cc:373] Index file not provided, it may take some time to infer it from the tar file
[/opt/dali/dali/operators/reader/loader/webdataset_loader.cc:373] Index file not provided, it may take some time to infer it from the tar file
[/opt/dali/dali/operators/reader/loader/webdataset_loader.cc:373] Index file not provided, it may take some time to infer it from the tar file
[/opt/dali/dali/operators/reader/loader/webdataset_loader.cc:373] Index file not provided, it may take some time to infer it from the tar file
Traceback (most recent call last):
  File "/mnt/home/zzy/code/coin-ssl/BiSeNet/tools/pretrain_ddep.py", line 339, in <module>
    main()
  File "/mnt/home/zzy/code/coin-ssl/BiSeNet/tools/pretrain_ddep.py", line 335, in main
    train()
  File "/mnt/home/zzy/code/coin-ssl/BiSeNet/tools/pretrain_ddep.py", line 227, in train
    dl_train, total_iters = create_dali_loader(cfg, mode='train')
  File "/mnt/home/zzy/code/coin-ssl/BiSeNet/./lib/data/dali_loader_pretrain_webdataset.py", line 212, in create_dali_loader
    pipe.build()
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 852, in build
    self._pipe.Build(self._generate_build_args())
RuntimeError: Critical error when building pipeline:
Error when constructing operator: readers__Webdataset encountered:
[/opt/dali/dali/operators/reader/loader/webdataset_loader.cc:465] Underful sample detected at tar file at "../../../datasets_share_to_all/OpenDataLab___SA-1B/raw/sa_000199.coin.tar"
Stacktrace (35 entries):
[frame 0]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(+0x69e80e) [0x7f2579b2180e]
[frame 1]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(+0x550e3c) [0x7f25799d3e3c]
[frame 2]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(+0x3cca1a9) [0x7f257d14d1a9]
[frame 3]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(+0x3f503c8) [0x7f257d3d33c8]
[frame 4]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(std::_Function_handler<std::unique_ptr<dali::OperatorBase, std::default_delete<dali::OperatorBase> > (dali::OpSpec const&), std::unique_ptr<dali::OperatorBase, std::default_delete<dali::OperatorBase> > (*)(dali::OpSpec const&)>::_M_invoke(std::_Any_data const&, dali::OpSpec const&)+0xe) [0x7f2579b1e9de]
[frame 5]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali.so(+0x1cde43) [0x7f2596b27e43]
[frame 6]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali.so(dali::InstantiateOperator(dali::OpSpec const&)+0x273) [0x7f2596b26013]
[frame 7]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali.so(dali::OpGraph::InstantiateOperators()+0xa8) [0x7f2596ac0618]
[frame 8]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/libdali.so(dali::Pipeline::Build(std::vector<dali::PipelineOutputDesc, std::allocator<dali::PipelineOutputDesc> >)+0x9c8) [0x7f2596b50ef8]
[frame 9]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/backend_impl.cpython-310-x86_64-linux-gnu.so(+0x4a043) [0x7f258bef3043]
[frame 10]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/backend_impl.cpython-310-x86_64-linux-gnu.so(+0x6a902) [0x7f258bf13902]
[frame 11]: /opt/miniconda3/envs/py310/lib/python3.10/site-packages/nvidia/dali/backend_impl.cpython-310-x86_64-linux-gnu.so(+0xbfd1a) [0x7f258bf68d1a]
[frame 12]: /opt/miniconda3/envs/py310/bin/python() [0x4fd907]
[frame 13]: /opt/miniconda3/envs/py310/bin/python(_PyObject_MakeTpCall+0x25b) [0x4f705b]
[frame 14]: /opt/miniconda3/envs/py310/bin/python() [0x5098bf]
[frame 15]: /opt/miniconda3/envs/py310/bin/python(_PyEval_EvalFrameDefault+0x4b26) [0x4f2856]
[frame 16]: /opt/miniconda3/envs/py310/bin/python(_PyFunction_Vectorcall+0x6f) [0x4fdd4f]
[frame 17]: /opt/miniconda3/envs/py310/bin/python(_PyEval_EvalFrameDefault+0x731) [0x4ee461]
[frame 18]: /opt/miniconda3/envs/py310/bin/python(_PyFunction_Vectorcall+0x6f) [0x4fdd4f]
[frame 19]: /opt/miniconda3/envs/py310/bin/python(_PyEval_EvalFrameDefault+0x13b3) [0x4ef0e3]
[frame 20]: /opt/miniconda3/envs/py310/bin/python(_PyFunction_Vectorcall+0x6f) [0x4fdd4f]
[frame 21]: /opt/miniconda3/envs/py310/bin/python(_PyEval_EvalFrameDefault+0x31f) [0x4ee04f]
[frame 22]: /opt/miniconda3/envs/py310/bin/python(_PyFunction_Vectorcall+0x6f) [0x4fdd4f]
[frame 23]: /opt/miniconda3/envs/py310/bin/python(_PyEval_EvalFrameDefault+0x31f) [0x4ee04f]
[frame 24]: /opt/miniconda3/envs/py310/bin/python() [0x5951c2]
[frame 25]: /opt/miniconda3/envs/py310/bin/python(PyEval_EvalCode+0x87) [0x595107]
[frame 26]: /opt/miniconda3/envs/py310/bin/python() [0x5c5ef7]
[frame 27]: /opt/miniconda3/envs/py310/bin/python() [0x5c1030]
[frame 28]: /opt/miniconda3/envs/py310/bin/python() [0x459781]
[frame 29]: /opt/miniconda3/envs/py310/bin/python(_PyRun_SimpleFileObject+0x19f) [0x5bb5bf]
[frame 30]: /opt/miniconda3/envs/py310/bin/python(_PyRun_AnyFileObject+0x43) [0x5bb323]
[frame 31]: /opt/miniconda3/envs/py310/bin/python(Py_RunMain+0x38d) [0x5b80dd]
[frame 32]: /opt/miniconda3/envs/py310/bin/python(Py_BytesMain+0x39) [0x5883f9]
[frame 33]: /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f265870c083]
[frame 34]: /opt/miniconda3/envs/py310/bin/python() [0x5882ae]

Current pipeline object is no longer valid.

Other/Misc.

No response

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report