/streaming

A Data Streaming Library for Efficient Neural Network Training

Primary LanguagePythonApache License 2.0Apache-2.0


A Data Streaming Library for Efficient Neural Network Training

PyPi Version PyPi Package Version Unit test PyPi Downloads Documentation Chat @ Slack License


๐Ÿ‘‹ Welcome

Streaming is a PyTorch compatible dataset that enables users to stream training data from cloud-based object stores. Streaming can read files from local disk or from cloud-based object stores. As a drop-in replacement for your PyTorch IterableDataset class, itโ€™s easy to get streaming:

dataloader = torch.utils.data.DataLoader(dataset=ImageStreamingDataset(remote='s3://...'))

Please check the quick start guide and user guide on how to use the Streaming Dataset.

Key Benefits

  • High performance, accurate streaming of training data from cloud storage
  • Efficiently train anywhere, independent of training data location
  • Cloud-native, no persistent storage required
  • Enhanced data securityโ€”data exists ephemerally on training cluster

๐Ÿš€ Quickstart

๐Ÿ’พ Installation

Streaming is available with Pip:

pip install mosaicml-streaming

Examples

Please check our Examples section for the end-to-end model training workflow using Streaming datasets.

๐Ÿ“š Documentation

Getting started guides, examples, API reference, and other useful information can be found in our docs.

๐Ÿ’ซ Contributors

We welcome any contributions, pull requests, or issues!

To start contributing, see our Contributing page.

P.S.: We're hiring!

โœ๏ธ Citation

@misc{mosaicml2022streaming,
    author = {The Mosaic ML Team},
    title = {streaming},
    year = {2022},
    howpublished = {\url{https://github.com/mosaicml/streaming/}},
}