zhurou603's Stars
pytorch/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
NVIDIA/NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
NVIDIA/Megatron-LM
Ongoing research training transformer models at scale
AccumulateMore/CV
✔(已完结)最全面的 深度学习 笔记【土堆 Pytorch】【李沐 动手学深度学习】【吴恩达 深度学习】
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
daquexian/onnx-simplifier
Simplify your onnx model
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
AlexanderZhou01/China-software-copyright
Chinese software copyright application template document
NVIDIA/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
microsoft/Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
intelligent-machine-learning/dlrover
DLRover: An Automatic Distributed Deep Learning System
NVIDIA/cccl
CUDA Core Compute Libraries
huggingface/nanotron
Minimalistic large language model 3D-parallelism training
ECNU-ICALK/EduChat
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
volcengine/veScale
A PyTorch Native LLM Training Framework
tspeterkim/flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
DefTruth/CUDA-Learn-Note
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
feifeibear/long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
BBuf/how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
zhangyachen/ComputerArchitectureAndCppBooks
📚 计算机体系结构与C++书籍收集(持续更新)
hahnyuan/LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
nihui/ruapu
Detect CPU features with single-file
Yinghan-Li/YHs_Sample
Yinghan's Code Sample
RussWong/CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
CalvinXKY/BasicCUDA
A tutorial for CUDA&PyTorch
njuhope/cuda_sgemm
feifeibear/LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
elithnever/distributedtechshare
分布式技术追踪
hzwer/brief_paper_reading
My paper reading and insights record
BBuf/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.