in 'webdataset' the keys in the resulting dict have leading '.'
hhoeflin opened this issue ยท 7 comments
๐ Describe the bug
When reading data from a tar-archive in 'webdataset'-format, the individual keys currently have the following form:
{'__key__': '/myfile', '.patch.jpeg': StreamWrapper</myfile.patch.jpeg,<ExFileObject name=None>>}
However, the leading '.' should be stripped from the key.
See for example in the docs to webdataset: https://webdataset.github.io/webdataset/creating/.
Versions
Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==1.1.1
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] pytorch-lightning==1.9.4
[pip3] torch==2.0.0
[pip3] torchdata==0.6.0
[pip3] torchmetrics==0.11.3
[pip3] torchvision==0.14.1
[pip3] triton==2.0.0
[conda] blas 2.116 mkl conda-forge
[conda] blas-devel 3.9.0 16_linux64_mkl conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libblas 3.9.0 16_linux64_mkl conda-forge
[conda] libcblas 3.9.0 16_linux64_mkl conda-forge
[conda] liblapack 3.9.0 16_linux64_mkl conda-forge
[conda] liblapacke 3.9.0 16_linux64_mkl conda-forge
[conda] mkl 2022.1.0 h84fe81f_915 conda-forge
[conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge
[conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge
[conda] numpy 1.23.5 pypi_0 pypi
[conda] pytorch-cuda 11.6 h867d48c_1 pytorch
[conda] pytorch-lightning 1.9.4 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.0.0 pypi_0 pypi
[conda] torchdata 0.6.0 pypi_0 pypi
[conda] torchmetrics 0.11.3 pypi_0 pypi
[conda] torchvision 0.14.1 py310_cu116 pytorch
[conda] triton 2.0.0 pypi_0 pypi
The WebDataset
DataPipe is trying to extract file extension from the path, and it is also only expecting everything after "." within the file name is part of the file extension.
For example,
pathsplit("/root/abc/some.tar.gz") # => ".tar.gz" will be used as key in resulting Dict
As a result, "1.jpg" and "1.txt"` would be grouped together in the same dictionary.
Can you elaborate on what file names are within your archives? How is this impacting your use case?
Hi, thanks for the answer. The behaviour I was pointing out is that using '.tar.gz' as the key in the dict is inconsistent with previous behaviour of webdataset, where the key would be 'tar.gz' (in the link above, see the example under 'Direct conversion of any dataset'.
Here also a short example. Do
touch a.jpg
touch a.png
tar cf wds_example.tar a.*
And then run the code
import webdataset as wds
dataset = wds.WebDataset("wds_example.tar")
for x in dataset:
print(x)
yielding
{'__key__': 'a', '__url__': 'wds_example.tar', 'jpg': b'', 'png': b''}
without the leading '.'.
Of course, as the difference is small, it is not directly impacting a use case and is easily remedied.
An example where removing the leading dot makes sense if we have files like
sample1.img.png
sample1.label.json
sample2.img.png
sample2.label.json
then giving dict with keys img.png
and label.json
instead of the (to me) unintuitive .img.png
and .label.json
.
I agree that without the "." it looks nicer. Still, this was a deliberate difference, since the absence of the "." in WebDataset ended up adding a bunch of special cases in various places.
We could add an option to let the user choose, also simplifying the porting from WebDataset.
I suppose it might not be too late to switch the default if people feel strongly about it.
If this was a deliberate choice, is ok for me. Where was the issue with removing the leading '.'? Was that about automatic decoding and recognizing file extensions when the '.' is missing?
Would be nice if it had an option to strip the '.' from the result. but then again that is a trivial function and can be done with a map. Would be nice to provide it though.
Quick question about the docs:
The prefix is used for grouping files into samples,
the suffix is used as key in the output dictionary.
The suffix consists of all components after the last
"." in the filename.
Should this be 'the suffix consists of all components after the first "." in the filename'?
Thanks!
Should this be 'the suffix consists of all components after the first "." in the filename'?
This seems to accurately describe what the implementation is doing. We should either change the doc string or the implementation.
Please feel free to re-open if there is any additional request or concern