/multinode-training-guide

Well documented examples of running distributed training jobs on Modal

Primary LanguagePythonMIT LicenseMIT

Important

Our multi-node cluster training product is in early preview and not generally accessible. Please contact us for access.


Modal Multinode Training Guide

Well documented examples of running distributed training jobs on Modal. Use this repository to learn how to build distributed training jobs on Modal.

Examples

  • resnet50/ training a ResNet50 model on the ImageNet dataset.
  • nanoGPT/ training Karpathy's nanoGPT reproduction of OpenAI's GPT-2.

Documentation

The multi-node training guide is currently available on Notion: modal-com.notion.site/Multi-node-docs.

Other relevant documentation in our guide:

Demo

multinode-resnet50.online-video-cutter.com.mp4

License

The MIT license.