An iterator that can stream over stdin
erip opened this issue ยท 0 comments
๐ The feature
An IterDataPipe
which can consume from stdin and automatically re-cyle each epoch.
Motivation, pitch
I'd like to push data augmentation and preprocessing upstream so model training/inference can operate directly on tokens streamed over stdin. This allows for tremendous flexibility without a user needing to hard-code a preprocessing pipeline in userland code. For an NLP use-case, I imagine something like...
paste <(cut -f1 train.tsv | spm_encode --model ...) \
<(cut -f2 train.tsv) | \
python train_with_stdin_iter.py --epochs 5
with some code similar to
def create_tensor(line):
X, y = line.strip().split("\t")
return vocab_lookup(X), int(y)
iter_dp = IterableWrapper(sys.stdin).map(create_tensor)
loader = DataLoader(iter_dp)
Alternatives
The preprocessed text could be written to a file which native torchdata constructs could operate on directly. This is fine, but requires a copy of the data to be written to disk.
Additional context
The current code doesn't work because sys.stdin
closes when it reaches EOF, so the dataloader only sees a single epoch worth of data.