pytorch/data

in 'webdataset' the keys in the resulting dict have leading '.'

hhoeflin opened this issue ยท 7 comments

๐Ÿ› Describe the bug

When reading data from a tar-archive in 'webdataset'-format, the individual keys currently have the following form:

{'__key__': '/myfile', '.patch.jpeg': StreamWrapper</myfile.patch.jpeg,<ExFileObject name=None>>}

However, the leading '.' should be stripped from the key.

See for example in the docs to webdataset: https://webdataset.github.io/webdataset/creating/.

Versions

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==1.1.1
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] pytorch-lightning==1.9.4
[pip3] torch==2.0.0
[pip3] torchdata==0.6.0
[pip3] torchmetrics==0.11.3
[pip3] torchvision==0.14.1
[pip3] triton==2.0.0
[conda] blas 2.116 mkl conda-forge
[conda] blas-devel 3.9.0 16_linux64_mkl conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libblas 3.9.0 16_linux64_mkl conda-forge
[conda] libcblas 3.9.0 16_linux64_mkl conda-forge
[conda] liblapack 3.9.0 16_linux64_mkl conda-forge
[conda] liblapacke 3.9.0 16_linux64_mkl conda-forge
[conda] mkl 2022.1.0 h84fe81f_915 conda-forge
[conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge
[conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge
[conda] numpy 1.23.5 pypi_0 pypi
[conda] pytorch-cuda 11.6 h867d48c_1 pytorch
[conda] pytorch-lightning 1.9.4 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.0.0 pypi_0 pypi
[conda] torchdata 0.6.0 pypi_0 pypi
[conda] torchmetrics 0.11.3 pypi_0 pypi
[conda] torchvision 0.14.1 py310_cu116 pytorch
[conda] triton 2.0.0 pypi_0 pypi

NivekT commented

The WebDataset DataPipe is trying to extract file extension from the path, and it is also only expecting everything after "." within the file name is part of the file extension.

For example,

pathsplit("/root/abc/some.tar.gz") #  => ".tar.gz" will be used as key in resulting Dict

As a result, "1.jpg" and "1.txt"` would be grouped together in the same dictionary.

Can you elaborate on what file names are within your archives? How is this impacting your use case?

Hi, thanks for the answer. The behaviour I was pointing out is that using '.tar.gz' as the key in the dict is inconsistent with previous behaviour of webdataset, where the key would be 'tar.gz' (in the link above, see the example under 'Direct conversion of any dataset'.

Here also a short example. Do

touch a.jpg
touch a.png
tar cf wds_example.tar a.*

And then run the code

import webdataset as wds

dataset = wds.WebDataset("wds_example.tar")

for x in dataset:
    print(x)

yielding

{'__key__': 'a', '__url__': 'wds_example.tar', 'jpg': b'', 'png': b''}

without the leading '.'.

Of course, as the difference is small, it is not directly impacting a use case and is easily remedied.

An example where removing the leading dot makes sense if we have files like

sample1.img.png
sample1.label.json
sample2.img.png
sample2.label.json

then giving dict with keys img.png and label.json instead of the (to me) unintuitive .img.png and .label.json.

NivekT commented

The DataPipe was written by @tmbdev in #367 who, I believe, also created the format. I'm inclined to think this is correct but I don't have a strong opinion on it.

@tmbdev would you like to chime in? Thanks!

tmbdev commented

I agree that without the "." it looks nicer. Still, this was a deliberate difference, since the absence of the "." in WebDataset ended up adding a bunch of special cases in various places.

We could add an option to let the user choose, also simplifying the porting from WebDataset.

I suppose it might not be too late to switch the default if people feel strongly about it.

If this was a deliberate choice, is ok for me. Where was the issue with removing the leading '.'? Was that about automatic decoding and recognizing file extensions when the '.' is missing?

Would be nice if it had an option to strip the '.' from the result. but then again that is a trivial function and can be done with a map. Would be nice to provide it though.

Quick question about the docs:

The prefix is used for grouping files into samples,
    the suffix is used as key in the output dictionary.
    The suffix consists of all components after the last
    "." in the filename.

Should this be 'the suffix consists of all components after the first "." in the filename'?

Thanks!

NivekT commented

Should this be 'the suffix consists of all components after the first "." in the filename'?

This seems to accurately describe what the implementation is doing. We should either change the doc string or the implementation.

NivekT commented

Please feel free to re-open if there is any additional request or concern