distributed-training

There are 169 repositories under distributed-training topic.

  • Made-With-ML

    GokuMohandas/Made-With-ML

    Learn how to design, develop, deploy and iterate on production-grade ML applications.

    Language:Jupyter Notebook37.8k1.2k736k
  • huggingface/pytorch-image-models

    The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

    Language:Python32.7k3149404.8k
  • PaddlePaddle/Paddle

    PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

    Language:C++22.4k71618.5k5.6k
  • PaddleNLP

    PaddlePaddle/PaddleNLP

    👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

    Language:Python12.2k1053.6k3k
  • skypilot-org/skypilot

    SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

    Language:Python6.9k701.9k531
  • FedML-AI/FedML

    FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

    Language:Python4.2k117327787
  • Fengshenbang-LM

    IDEA-CCNL/Fengshenbang-LM

    Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

    Language:Python4.1k58296379
  • bytedance/byteps

    A high performance and generic framework for distributed DNN training

    Language:Python3.6k84267490
  • tensorflow/adanet

    Fast and flexible AutoML with learning guarantees.

    Language:Jupyter Notebook3.5k172114527
  • alpa-projects/alpa

    Training and serving large-scale neural networks with auto parallelization.

    Language:Python3.1k46297360
  • determined

    determined-ai/determined

    Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

    Language:Go3.1k84387359
  • learning-at-home/hivemind

    Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

    Language:Python2.1k56165170
  • intelligent-machine-learning/dlrover

    DLRover: An Automatic Distributed Deep Learning System

    Language:Python1.3k49257168
  • tensorlayer/HyperPose

    Library for Fast and Flexible Human Pose Estimation

    Language:Python1.3k58184275
  • DeepRec-AI/DeepRec

    DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

    Language:C++1.1k37128356
  • mryab/efficient-dl-systems

    Efficient Deep Learning Systems course materials (HSE, YSDA)

    Language:Jupyter Notebook707144113
  • alibaba/Megatron-LLaMA

    Best practice for training LLaMA models in Megatron-LM

    Language:Python63676455
  • Guitaricet/relora

    Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

    Language:Jupyter Notebook43971739
  • adaptdl

    petuum/adaptdl

    Resource-adaptive cluster scheduler for deep learning training.

    Language:Python433115879
  • Oneflow-Inc/libai

    LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

    Language:Python391417955
  • DataCanvasIO/HyperGBM

    A full pipeline AutoML tool for tabular data

    Language:Python343165546
  • pytorch/torchx

    TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

    Language:Python33521181113
  • LambdaLabsML/distributed-training-guide

    Best practices & guides on how to write distributed pytorch training code

    Language:Python31873820
  • lsds/KungFu

    Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

    Language:Go293234458
  • maudzung/YOLO3D-YOLOv4-PyTorch

    YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

    Language:Python29211544
  • HandyRL

    DeNA/HandyRL

    HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

    Language:Python284142443
  • nanodl

    HMUNACHI/nanodl

    A Jax-based library for designing and training transformer models from scratch.

    Language:Python2789911
  • PKU-DAIR/Hetu

    A high-performance distributed deep learning system targeting large-scale and automated distributed training.

    Language:Python2667030
  • alibaba/EasyParallelLibrary

    Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

    Language:Python265131049
  • awslabs/deeplearning-cfn

    Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow

    Language:Python2543912104
  • dougsouza/pytorch-sync-batchnorm-example

    How to use Cross Replica / Synchronized Batchnorm in Pytorch

  • synxlin/deep-gradient-compression

    [ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

    Language:Python2147445
  • foundation-model-stack/fms-fsdp

    🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

    Language:Python206123432
  • wenwei202/terngrad

    Ternary Gradients to Reduce Communication in Distributed Deep Learning (TensorFlow)

    Language:Python183111448
  • chairc/Integrated-Design-Diffusion-Model

    IDDM (Industrial, landscape, animate, spectrogram...), support DDPM, DDIM, PLMS, webui and distributed training. Pytorch实现扩散模型,生成模型,分布式训练

    Language:Python16132324
  • ZJU-OpenKS/OpenKS

    OpenKS - 领域可泛化的知识学习与计算平台

    Language:Python1604967