limin2021's Stars
opencv/opencv
Open Source Computer Vision Library
imarvinle/awesome-cs-books
🔥 经典编程书籍大全,涵盖:计算机系统与网络、系统架构、算法与数据结构、前端开发、后端开发、移动开发、数据库、测试、项目与团队、程序员职业修炼、求职面试等
alibaba/MNN
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
huggingface/accelerate
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
CVCUDA/CV-CUDA
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
laekov/fastmoe
A fast MoE impl for PyTorch
intelligent-machine-learning/dlrover
DLRover: An Automatic Distributed Deep Learning System
DefTruth/CUDA-Learn-Notes
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
NVIDIA/nccl-tests
NCCL Tests
andravin/wincnn
Winograd minimal convolution algorithm generator for convolutional neural networks.
lhao499/ringattention
Transformers with Arbitrarily Large Context
volcengine/veScale
A PyTorch Native LLM Training Framework
NVIDIA/AMGX
Distributed multigrid linear solver library on GPU
bytedance/ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
zhuzilin/ring-flash-attention
Ring attention implementation with flash attention
codeplaysoftware/portBLAS
An implementation of BLAS using the SYCL open standard.
InternLM/InternEvo
sail-sg/zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
feifeibear/long-context-attention
Sequence Parallel Attention for Long Context LLM Model Training and Inference
mit-han-lab/inter-operator-scheduler
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
rsnemmen/OpenCL-examples
Simple OpenCL examples for exploiting GPU computing
RulinShao/LightSeq
Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
FlagOpen/FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
codeplaysoftware/portDNN
portDNN is a library implementing neural network algorithms written using SYCL
anyscale/llm-continuous-batching-benchmarks
icl-utk-edu/blaspp
BLAS++ is a C++ wrapper around CPU and GPU BLAS (basic linear algebra subroutines), developed as part of the SLATE project.
exists-forall/striped_attention
UDC-GAC/venom
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
lzhangbv/dear_pytorch
[ICDCS 2023] DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining
kjbartel/clmagma
OpenCL version of Matrix Algebra on GPU and Multicore Architectures (MAGMA) source releases from http://icl.cs.utk.edu/magma/index.html