Armada is a multi-Kubernetes cluster batch job scheduler.
Users submit jobs, which are expressed as a Kubernetes pod spec plus Armada-specific metadata, to a central Armada server. Armada stores jobs in user or project-specific queues that are backed by a specialized high-throughput storage layer. Armada manages several Kubernetes worker clusters that queued jobs are dispatched to.
Armada is designed to operate at scale and to address the following issues:
- A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several single-cluster schedulers, e.g., the vanilla scheduler or Volcano.
- Acheiving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer (i.e., Armada, does not primarily rely on etcd).
Further, Armada is designed primarily for machine learning, AI, and data analytics workloads, and to:
- Manage compute clusters composed of tens of thousands of nodes in total.
- Schedule a thousand or more pods per second, on average.
- Enqueue tens of thousands of jobs over a few seconds.
- Divide resources fairly between users.
- Provide visibility for users and admins.
- Ensure near-constant uptime.
Armada is a CNCF Sandbox project in production at G Research and is actively developed.
For an overview of Armada, see this video.
For an overview of the architecture and design of Armada, and instructions for submitting jobs, see:
For instructions of how to setup and develop Armada, see:
For API reference, see:
We expect readers of the documentation to have a basic understanding of Docker and Kubernetes; see, e.g., the following links: