pytorch/data

An iterator that can stream over stdin

erip opened this issue ยท 0 comments

erip commented

๐Ÿš€ The feature

An IterDataPipe which can consume from stdin and automatically re-cyle each epoch.

Motivation, pitch

I'd like to push data augmentation and preprocessing upstream so model training/inference can operate directly on tokens streamed over stdin. This allows for tremendous flexibility without a user needing to hard-code a preprocessing pipeline in userland code. For an NLP use-case, I imagine something like...

paste <(cut -f1 train.tsv | spm_encode --model ...) \
      <(cut -f2 train.tsv) | \
      python train_with_stdin_iter.py --epochs 5

with some code similar to

def create_tensor(line):
    X, y = line.strip().split("\t")
    return vocab_lookup(X), int(y)


iter_dp = IterableWrapper(sys.stdin).map(create_tensor)
loader = DataLoader(iter_dp)

Alternatives

The preprocessed text could be written to a file which native torchdata constructs could operate on directly. This is fine, but requires a copy of the data to be written to disk.

Additional context

The current code doesn't work because sys.stdin closes when it reaches EOF, so the dataloader only sees a single epoch worth of data.