distributed-training
There are 169 repositories under distributed-training topic.
GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
huggingface/pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
PaddlePaddle/Paddle
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
PaddlePaddle/PaddleNLP
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
skypilot-org/skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
FedML-AI/FedML
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
IDEA-CCNL/Fengshenbang-LM
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。
bytedance/byteps
A high performance and generic framework for distributed DNN training
tensorflow/adanet
Fast and flexible AutoML with learning guarantees.
alpa-projects/alpa
Training and serving large-scale neural networks with auto parallelization.
determined-ai/determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
learning-at-home/hivemind
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
intelligent-machine-learning/dlrover
DLRover: An Automatic Distributed Deep Learning System
tensorlayer/HyperPose
Library for Fast and Flexible Human Pose Estimation
DeepRec-AI/DeepRec
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
mryab/efficient-dl-systems
Efficient Deep Learning Systems course materials (HSE, YSDA)
alibaba/Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
Guitaricet/relora
Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
petuum/adaptdl
Resource-adaptive cluster scheduler for deep learning training.
Oneflow-Inc/libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
DataCanvasIO/HyperGBM
A full pipeline AutoML tool for tabular data
pytorch/torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
LambdaLabsML/distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
lsds/KungFu
Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.
maudzung/YOLO3D-YOLOv4-PyTorch
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
DeNA/HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
HMUNACHI/nanodl
A Jax-based library for designing and training transformer models from scratch.
PKU-DAIR/Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
alibaba/EasyParallelLibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
awslabs/deeplearning-cfn
Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow
dougsouza/pytorch-sync-batchnorm-example
How to use Cross Replica / Synchronized Batchnorm in Pytorch
synxlin/deep-gradient-compression
[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
foundation-model-stack/fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
wenwei202/terngrad
Ternary Gradients to Reduce Communication in Distributed Deep Learning (TensorFlow)
chairc/Integrated-Design-Diffusion-Model
IDDM (Industrial, landscape, animate, spectrogram...), support DDPM, DDIM, PLMS, webui and distributed training. Pytorch实现扩散模型,生成模型,分布式训练
ZJU-OpenKS/OpenKS
OpenKS - 领域可泛化的知识学习与计算平台