This a proof-of-concept repository on how torch.utils.data.datapipes
can be used as basis for torchvision.datasets
.
pathlib.Path
should be a first-class citizen for paths.dp.iter.LoadFilesFromDisk
should have amode
parameter. Forcingrb
makes it cumbersome to read from plain text files. Maybe even anopener
parameter would be better that defaults toopen
and respectsmode
.- Files loaded with
get_file_binaries_from_pathnames
used indp.iter.LoadFilesFromDisk
are never closed. dp.Iter.RoutedDecoder
only accepts(path, buffer)
inputs, which is not usable for us. Our datasets return a buffer as well as some additional information.- It feels weird to call
dp.iter.LoadFilesFromDisk
for a single file, which is usually the case for our datasets. - I'm aware that this is not possible if we are streaming archives, but if that is not the case, we should be able to read specific files from an archive. Some datasets contain metadata in a separate file that should be available as soon as we create the dataset rather than based on luck when it is stream with the other files.
dp.iter.Map
expects anIterDataPipe
rather than a more generalIterable
as the other datapipes.- Instead of
ReadFilesFrom(Tar|Zip)
there should beReadFilesFromArchive
that automatically detect the underlying archive type. dp.iter.ReadFilesFrom(Tar|Zip)
should be split inListFilesIn(Tar|Zip)
andLoadFilesFrom(Tar|Zip)
. Most datasets define some splits of the data so that only a part of the data has to be loaded at all. It would be a good idea to drop unused files before we load them.- For some reason
dp.iter.ReadFilesFrom(Tar|Zip)
returns the files in reversed alphabetical order. This makes it weird to align this with corresponding text files, which are usually read from top to bottom.
Legend:
- ✔️ : Fully working
- ⭕ : Working, but with a significant performance hit
- ❌ Not working.
For ⭕ and ❌, please check out the README.md
in the corresponding folder for details.
torchvision.datasets. |
Status |
---|---|
Caltech101 |
✔️ |
Caltech256 |
✔️ |
CelebA |
✔️ |
CIFAR10 / CIFAR100 |
✔️ |
CocoDetection / CocoCaptions |
✔️ |
VOCDetection / VOCSegmentation |
✔️ |
LSUN |
❌ |
ImageNet |
✔️ |
HMDB51 |
✔️ |
- So far, I think the best approach for datasets with related files is to have each individual datapipe to yield a key for the datapoint as well as the data.