/aistore

AIStore: scalable storage for AI applications

Primary LanguageGoMIT LicenseMIT

AIStore is a lightweight object storage system with the capability to linearly scale-out with each added storage node and a special focus on petascale deep learning.

License Go Report Card

AIStore (AIS for short) is a built from scratch, lightweight storage stack tailored for AI apps. AIS consistently shows balanced I/O distribution and linear scalability across arbitrary numbers of clustered servers, producing performance charts that look as follows:

I/O distribution

The picture above comprises 120 HDDs.

The ability to scale linearly with each added disk was, and remains, one of the main incentives behind AIStore. Much of the development is also driven by the ideas to offload dataset transformation and other I/O intensive stages of the ETL pipelines.

Features

  • scale-out with no downtime and no limitation;
  • arbitrary number of extremely lightweight access points;
  • highly-available control and data planes, end-to-end data protection, self-healing, n-way mirroring, k/m erasure coding;
  • comprehensive native HTTP REST to GET and PUT objects, create, destroy, list, transform, copy and configure buckets, and more;
  • Amazon S3 API to run unmodified S3 clients and apps;
  • easy-to-use CLI based on auto-completions;
  • automated cluster rebalancing upon: any changes in cluster membership, drive failures and attachments, bucket renames;
  • ETL offload - the capability to run extract-transform-load workloads on (and by) storage cluster (and close to data); offline (dataset to dataset) and inline transformations via both user-defined containers and functions are also supported.

Also, AIStore:

  • can be deployed on any commodity hardware - effectively, on any Linux with a disk;
  • supports single-command infrastructure and software deployment on Google Cloud Platform via ais-k8s GitHub repo;
  • supports Amazon S3, Google Cloud, and Microsoft Azure backends (and all S3, GCS, and Azure-compliant object storages);
  • can ad-hoc attach and "see" (read, write, list, cache, evict) datasets hosted by other AIS clusters;
  • natively supports reading, writing, and listing archives - objects formatted as TAR, TGZ, ZIP;
  • provides unified global namespace across multiple backends:

AIStore

  • can be deployed as LRU-based fast cache for remote buckets; can be populated on-demand and/or via prefetch and download APIs;
  • can be used as a standalone highly-available protected storage;
  • includes MapReduce extension for massively parallel resharding of very large datasets;
  • supports existing PyTorch and TensorFlow-based training models.

AIS runs natively on Kubernetes and features open format - thus, the freedom to copy or move your data from AIS at any time using the familiar Linux tar(1), scp(1), rsync(1) and similar.

For AIStore white paper and design philosophy, for introduction to large-scale deep learning and the most recently added features, please see AIStore Overview (where you can also find six alternative ways to work with existing datasets). Videos and animated presentations can be found at videos.

Finally, getting started with AIS takes only a few minutes.

Deployment options

There is a vast spectrum of possible deployments - primarily due to the fact that the essential prerequisites boil down to having Linux with a disk. This results in a practically unlimited set of options from all-in-one (AIS gateway + AIS target) docker container to a petascale bare-metal cluster of any size, and from a single development VM or workstation to multiple racks of high-end servers.

The table below contains a few concrete examples:

Deployment option Targeted audience and objective
Local playground AIS developers and development, Linux or Mac OS
Minimal production-ready deployment This option utilizes preinstalled docker image and is targeting first-time users or researchers (who could immediately start training their models on smaller datasets)
Easy automated GCP/GKE deployment Developers, first-time users, AI researchers
Large-scale production deployment Requires Kubernetes and is provided (documented, automated) via a separate repository: ais-k8s

Further, there's the capability referred to as global namespace. Simply put, as long as there’s HTTP connectivity, AIS clusters can be easily interconnected to “see” - i.e., list, read, write, cache, evict - each other's datasets. This ad-hoc capability, in turn, makes it possible to start small and gradually/incrementally build high-performance shared storage comprising petabytes.

For detailed discussion on supported deployments, please refer to Getting Started.

Guides and References

License

MIT

Author

Alex Aizman (NVIDIA)