distributed-training

There are 145 repositories under distributed-training topic.

  • Made-With-ML

    GokuMohandas/Made-With-ML

    Learn how to design, develop, deploy and iterate on production-grade ML applications.

    Language:Jupyter Notebook36.1k1.2k675.8k
  • huggingface/pytorch-image-models

    The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNet-V3/V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

    Language:Python30.2k3098734.6k
  • PaddlePaddle/Paddle

    PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

    Language:C++21.7k72118k5.5k
  • PaddleNLP

    PaddlePaddle/PaddleNLP

    👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

    Language:Python11.6k1033.4k2.8k
  • skypilot-org/skypilot

    SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

    Language:Python5.8k661.6k403
  • FedML-AI/FedML

    FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

    Language:Python4.1k114323768
  • Fengshenbang-LM

    IDEA-CCNL/Fengshenbang-LM

    Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。

    Language:Python3.9k56289365
  • bytedance/byteps

    A high performance and generic framework for distributed DNN training

    Language:Python3.6k85267483
  • tensorflow/adanet

    Fast and flexible AutoML with learning guarantees.

    Language:Jupyter Notebook3.5k173114531
  • alpa-projects/alpa

    Training and serving large-scale neural networks with auto parallelization.

    Language:Python3k45295344
  • determined

    determined-ai/determined

    Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

    Language:Go2.9k81368346
  • learning-at-home/hivemind

    Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

    Language:Python1.8k57160138
  • tensorlayer/HyperPose

    Library for Fast and Flexible Human Pose Estimation

    Language:Python1.2k58184275
  • intelligent-machine-learning/dlrover

    DLRover: An Automatic Distributed Deep Learning System

    Language:Python1k49211130
  • DeepRec-AI/DeepRec

    DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

    Language:C++98033115340
  • mryab/efficient-dl-systems

    Efficient Deep Learning Systems course materials (HSE, YSDA)

    Language:Jupyter Notebook59213495
  • alibaba/Megatron-LLaMA

    Best practice for training LLaMA models in Megatron-LM

    Language:Python55265550
  • Guitaricet/relora

    Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

    Language:Jupyter Notebook40981534
  • adaptdl

    petuum/adaptdl

    Resource-adaptive cluster scheduler for deep learning training.

    Language:Python407115774
  • Oneflow-Inc/libai

    LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

    Language:Python377437955
  • DataCanvasIO/HyperGBM

    A full pipeline AutoML tool for tabular data

    Language:Python324155445
  • pytorch/torchx

    TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

    Language:Python3041917997
  • lsds/KungFu

    Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

    Language:Go289234458
  • maudzung/YOLO3D-YOLOv4-PyTorch

    YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

    Language:Python28611544
  • HandyRL

    DeNA/HandyRL

    HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

    Language:Python282132341
  • nanodl

    HMUNACHI/nanodl

    A Jax-based library for designing and training transformer models from scratch.

    Language:Python2638911
  • awslabs/deeplearning-cfn

    Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow

    Language:Python2563912115
  • alibaba/EasyParallelLibrary

    Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

    Language:Python25313949
  • dougsouza/pytorch-sync-batchnorm-example

    How to use Cross Replica / Synchronized Batchnorm in Pytorch

  • PKU-DAIR/Hetu

    A high-performance distributed deep learning system targeting large-scale and automated distributed training.

    Language:Python2357027
  • synxlin/deep-gradient-compression

    [ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

    Language:Python2067443
  • deepglint/unicom

    universal visual model trained on LAION-400M

    Language:Python20092115
  • wenwei202/terngrad

    Ternary Gradients to Reduce Communication in Distributed Deep Learning (TensorFlow)

    Language:Python180111448
  • ZJU-OpenKS/OpenKS

    OpenKS - 领域可泛化的知识学习与计算引擎

    Language:Python1554967
  • PaddlePaddle/PLSC

    Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.

    Language:Python144214333
  • huggingface/chug

    Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

    Language:Python1341039