pytorch/data

Additional basic functions beyond .map to allow for more functional programming

hhoeflin opened this issue ยท 1 comments

๐Ÿš€ The feature

For IterDataPipe, the .map maps a function over the items of an iterable. where the function has the form

f: Any -> Any

Other basic building blocks could be .pipe, .iter_map and .comsume. where

  • .pipe would take f: Iterable -> Iterable
  • .iter_map takes f: Any -> Iterable
  • .comsume takes f: Iterable -> Any

Motivation, pitch

Such an approach would allow for more flexible functional programming and would reduce most currently provided IterDataPipe classes to a simple functional call. For example

The Enumerator class would become

dp.pipe(enumerate)

This would immediately enable to use all itertools functions in this context.

The TarArchiveLoader could become

def iter_from_tar_archive(fd):
    .<code to yield files from tar archive >
dp.iter_map(iter_from_tar_archive)

I believe using this approach, almost all provided classes could be written using less boilerplate using generator functions (essentially just writing the code inside __iter__ as a standalone generator function, possibly curried for convenience if other parameters are being used).

Would be great to hear if this was considered? Thanks!

Alternatives

The .pipe can already be written as

dp2 = IterableWrapper(enumerate(dp)) 

but I believe this would be a lot less nice than the above

dp.pipe(enumerate)

Additional context

No response

Just wanted to ping about this issue. Would be great to hear the development teams perspective. Even after looking into it more, it still appears to me that most of the functionality provided could be exposed as individual functions.

Would be great to know if I am missing something or misunderstand about the functionality of torchdata.

Thanks