cylondata/cylon

[Util][Python][PyTorch] Support a distributed data sampler

vibhatha opened this issue · 0 comments

At the moment Cylon doesn't support a distribtued data sampler for deep learning systems. PyTorch contains an abstraction on supporting various sampling modes. This is bound to a data loader which uses the sampler to shuffle the data for each epoch based on a various sampling criteria. This is a useful feature for deep learning applications using Cylon as the data source for an end-to-end data analytics aware data engineering workload. Will be very useful in the feature engineering space and exploratory data analytics.

To support this, we can start by providing a set of utils for pytorch users. It can come under utils/pytorch/...
This shouldn't be a core component of Cylon as this is not part of a data processing library.

References:

https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler