/bagua

Bagua is a flexible and performant distributed training algorithm development framework.

Primary LanguagePythonMIT LicenseMIT

Bagua

Generic badge Documentation Status PyPI version Docker GitHub license

Bagua is a distributed training utility developed by Kuaishou Technology and DS3 Lab@ETH. Users can extend the training on a single GPU to multi-GPUs (may across multiple machines), with excellent speedup guarantee, by simply adding a few lines of code. Bagua also provides a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. Researchers can easily develop new distributed training algorithms based on bagua, without sacrificing system performance.

So far, Bagua has integrated primitives including

  • Centralized Synchronous Communication (AllReduce)
  • Decentralized Synchronous Communication
  • Low Precision Communication

Its effectiveness has been verified in various scenarios, including VGG and ResNet on ImageNet, Bert Large and many industrial applications at Kuaishou.

The underlying communication execution engine is in bagua-core, a library written in Rust.

Performance

The scalability of different systems on VGG16 with up to 128 GPUs.



Epoch time of BERT-Large Finetune under different network conditions for different systems.

For more comprehensive and up to date results, refer to Bagua benchmark page.

Installation

Develop version:

pip install git+https://github.com/BaguaSys/bagua.git

Release version:

pip install bagua

Build API documentation locally

pip install -r docs/doc-requirements.txt
make html

Cite Bagua

@misc{gan2021bagua,
  title={BAGUA: Scaling up Distributed Learning with System Relaxations}, 
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hongmei Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},
  year={2021},
  eprint={2107.01499},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

@book{liu2020distributed,
  title={Distributed Learning Systems with First-Order Methods: An Instruction},
  author={Liu, J. and Zhang, C.},
  isbn={9781680837018},
  series={Foundations and trends in databases},
  url={https://books.google.com/books?id=vzQmzgEACAAJ},
  year={2020},
  publisher={now publishers}
}

Links