distributed-training

There are 218 repositories under distributed-training topic.

GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Language:Jupyter Notebook44.3k 1.3k 746.9k
huggingface/pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
Language:Python35.7k 315 1k5.1k
PaddlePaddle/Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）
Language:C++23.4k 710 19.1k5.9k
PaddlePaddle/PaddleNLP
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
Language:Python12.8k 96 3.8k3.1k
Netflix/metaflow
Build, Manage and Deploy AI/ML Systems
Language:Python9.6k 289 742928
skypilot-org/skypilot
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem).
Language:Python8.9k 69 2.9k838
IDEA-CCNL/Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。
Language:Python4.1k 56 301383
FedML-AI/FedML
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
Language:Python4k 91 330761
bytedance/byteps
A high performance and generic framework for distributed DNN training
Language:Python3.7k 81 267494
tensorflow/adanet
Fast and flexible AutoML with learning guarantees.
Language:Jupyter Notebook3.5k 166 114530
determined-ai/determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
Language:Go3.2k 78 392369
alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
Language:Python3.2k 45 297354
learning-at-home/hivemind
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
Language:Python2.3k 54 175204
intelligent-machine-learning/dlrover
DLRover: An Automatic Distributed Deep Learning System
Language:Python1.6k 49 305198
pytorch/gloo
Collective communications library with various primitives for multi-machine training.
Language:C++1.4k 61 132338
tensorlayer/HyperPose
Library for Fast and Flexible Human Pose Estimation
Language:Python1.3k 55 184275
DeepRec-AI/DeepRec
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
Language:C++1.1k 33 130357
mryab/efficient-dl-systems
Efficient Deep Learning Systems course materials (HSE, YSDA)
Language:Jupyter Notebook921 13 4139
alibaba/Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
Language:Python659 6 6456
sail-sg/oat
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.
Language:Python562 8 1447
LambdaLabsML/distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
Language:Python533 8 4653
Guitaricet/relora
Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
Language:Jupyter Notebook467 8 1741
petuum/adaptdl
Resource-adaptive cluster scheduler for deep learning training.
Language:Python449 9 5879
Oneflow-Inc/libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
Language:Python407 40 7957
meta-pytorch/torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
Language:Python399 17 202147
aws-samples/awsome-distributed-training
Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
Language:Shell366 13 236150
DataCanvasIO/HyperGBM
A full pipeline AutoML tool for tabular data
Language:Python359 12 5547
PKU-DAIR/Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
Language:Python326 8 239
maudzung/YOLO3D-YOLOv4-PyTorch
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
Language:Python310 10 646
DeNA/HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
Language:Python300 12 2543
HenryNdubuaku/nanodl
A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.
Language:Python29712
lsds/KungFu
Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.
Language:Go296 22 4459
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
Language:Python271 11 4445
alibaba/EasyParallelLibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
Language:Python270 11 1049
awslabs/deeplearning-cfn
Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow
Language:Python252 36 12103
dougsouza/pytorch-sync-batchnorm-example
How to use Cross Replica / Synchronized Batchnorm in Pytorch
247 5 325

distributed-training

GokuMohandas/Made-With-ML

huggingface/pytorch-image-models

PaddlePaddle/Paddle

PaddlePaddle/PaddleNLP

Netflix/metaflow

skypilot-org/skypilot

IDEA-CCNL/Fengshenbang-LM

FedML-AI/FedML

bytedance/byteps

tensorflow/adanet

determined-ai/determined

alpa-projects/alpa

learning-at-home/hivemind

intelligent-machine-learning/dlrover

pytorch/gloo

tensorlayer/HyperPose

DeepRec-AI/DeepRec

mryab/efficient-dl-systems

alibaba/Megatron-LLaMA

sail-sg/oat

LambdaLabsML/distributed-training-guide

Guitaricet/relora

petuum/adaptdl

Oneflow-Inc/libai

meta-pytorch/torchx

aws-samples/awsome-distributed-training

DataCanvasIO/HyperGBM

PKU-DAIR/Hetu

maudzung/YOLO3D-YOLOv4-PyTorch

DeNA/HandyRL

HenryNdubuaku/nanodl

lsds/KungFu

foundation-model-stack/fms-fsdp

alibaba/EasyParallelLibrary

awslabs/deeplearning-cfn

dougsouza/pytorch-sync-batchnorm-example