byungsoo-oh's Stars
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
microsoft/LoRA
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
vosen/ZLUDA
CUDA on non-NVIDIA GPUs
SJTU-IPADS/PowerInfer
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
sgl-project/sglang
SGLang is a fast serving framework for large language models and vision language models.
NVIDIA/cutlass
CUDA Templates for Linear Algebra Subroutines
mosaicml/composer
Supercharge Your Model Training
mosaicml/llm-foundry
LLM training code for Databricks foundation models
cybertronai/gradient-checkpointing
Make huge neural nets fit in memory
google-deepmind/gemma
Open weights LLM from Google DeepMind.
databricks/dbrx
Code examples and resources for DBRX, a large language model developed by Databricks
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
NUS-HPC-AI-Lab/OpenDiT
OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference
RahulSChand/gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
mosaicml/streaming
A Data Streaming Library for Efficient Neural Network Training
myshell-ai/JetMoE
Reaching LLaMA2 Performance with 0.1M Dollars
pjlab-sys4nlp/llama-moe
⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
volcengine/veScale
A PyTorch Native LLM Training Framework
alibaba/Megatron-LLaMA
Best practice for training LLaMA models in Megatron-LM
forhaoliu/ringattention
Transformers with Arbitrarily Large Context
LLMServe/DistServe
Disaggregated serving system for Large Language Models (LLMs).
sail-sg/zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
microsoft/mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
UpstageAI/evalverse
The Universe of Evaluation. All about the evaluation for LLMs.
efeslab/fiddler
Fast Inference of MoE Models with CPU-GPU Orchestration
microsoft/ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
TorchMoE/MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
parasailteam/coconet
mental2008/awesome-papers
Here are my personal paper reading notes (including cloud computing, resource management, systems, machine learning, deep learning, and other interesting stuffs).
raymin0223/fast_robust_early_exit
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)