distributed-training

There are 202 repositories under distributed-training topic.

  • Made-With-ML

    GokuMohandas/Made-With-ML

    Learn how to design, develop, deploy and iterate on production-grade ML applications.

    Language:Jupyter Notebook43.1k1.3k746.7k
  • huggingface/pytorch-image-models

    The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

    Language:Python35.3k3199995k
  • PaddlePaddle/Paddle

    PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

    Language:C++23.2k71419k5.8k
  • PaddleNLP

    PaddlePaddle/PaddleNLP

    Easy-to-use and powerful LLM and SLM library with awesome model zoo.

    Language:Python12.8k1003.8k3.1k
  • skypilot-org/skypilot

    Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem).

    Language:Python8.7k722.8k776
  • Fengshenbang-LM

    IDEA-CCNL/Fengshenbang-LM

    Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

    Language:Python4.1k58301385
  • FedML-AI/FedML

    FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

    Language:Python3.9k94329761
  • bytedance/byteps

    A high performance and generic framework for distributed DNN training

    Language:Python3.7k82267493
  • tensorflow/adanet

    Fast and flexible AutoML with learning guarantees.

    Language:Jupyter Notebook3.5k170114531
  • determined

    determined-ai/determined

    Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

    Language:Go3.2k81391369
  • alpa-projects/alpa

    Training and serving large-scale neural networks with auto parallelization.

    Language:Python3.2k46297354
  • learning-at-home/hivemind

    Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

    Language:Python2.3k56175198
  • intelligent-machine-learning/dlrover

    DLRover: An Automatic Distributed Deep Learning System

    Language:Python1.6k44275198
  • pytorch/gloo

    Collective communications library with various primitives for multi-machine training.

    Language:C++1.4k62131340
  • tensorlayer/HyperPose

    Library for Fast and Flexible Human Pose Estimation

    Language:Python1.3k57184275
  • DeepRec-AI/DeepRec

    DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

    Language:C++1.1k34130360
  • mryab/efficient-dl-systems

    Efficient Deep Learning Systems course materials (HSE, YSDA)

    Language:Jupyter Notebook896134136
  • alibaba/Megatron-LLaMA

    Best practice for training LLaMA models in Megatron-LM

    Language:Python66176457
  • LambdaLabsML/distributed-training-guide

    Best practices & guides on how to write distributed pytorch training code

    Language:Python47874343
  • Guitaricet/relora

    Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

    Language:Jupyter Notebook46381740
  • sail-sg/oat

    🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.

    Language:Python46171232
  • adaptdl

    petuum/adaptdl

    Resource-adaptive cluster scheduler for deep learning training.

    Language:Python446105878
  • Oneflow-Inc/libai

    LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

    Language:Python408417957
  • pytorch/torchx

    TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

    Language:Python38821195144
  • DataCanvasIO/HyperGBM

    A full pipeline AutoML tool for tabular data

    Language:Python357145547
  • aws-samples/awsome-distributed-training

    Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.

    Language:Shell34214228143
  • PKU-DAIR/Hetu

    A high-performance distributed deep learning system targeting large-scale and automated distributed training.

    Language:Python3228238
  • maudzung/YOLO3D-YOLOv4-PyTorch

    YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

    Language:Python30610644
  • HandyRL

    DeNA/HandyRL

    HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

    Language:Python299122443
  • lsds/KungFu

    Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

    Language:Go298224459
  • nanodl

    HenryNdubuaku/nanodl

    A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.

    Language:Python29111
  • alibaba/EasyParallelLibrary

    Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

    Language:Python268121049
  • foundation-model-stack/fms-fsdp

    🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

    Language:Python265113943
  • awslabs/deeplearning-cfn

    Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow

    Language:Python2523812104
  • dougsouza/pytorch-sync-batchnorm-example

    How to use Cross Replica / Synchronized Batchnorm in Pytorch

  • chairc/Integrated-Design-Diffusion-Model

    IDDM (Industrial, landscape, animate, latent diffusion), support LDM, DDPM, DDIM, PLMS, webui and distributed training. Pytorch实现扩散模型,生成模型,分布式训练

    Language:Python22822728