sjjeong94's Stars
cloneofsimo/minRF
Minimal implementation of scalable rectified flow transformers, based on SD3's approach
NVIDIA/cutlass
CUDA Templates for Linear Algebra Subroutines
facebookresearch/DiT
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
NVIDIA/nccl
Optimized primitives for collective multi-GPU communication
tspeterkim/paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
karpathy/LLM101n
LLM101n: Let's build a Storyteller
karpathy/build-nanogpt
Video+code lecture on building nanoGPT from scratch
naklecha/llama3-from-scratch
llama3 implementation one matrix multiplication at a time
tspeterkim/mixed-precision-from-scratch
Mixed precision training from scratch with Tensors and CUDA
microsoft/autogen
A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
tspeterkim/flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
microsoft/DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
meta-llama/llama3
The official Meta Llama 3 GitHub site
bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
siboehm/SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
apache/tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
karpathy/llm.c
LLM training in simple, raw C/CUDA
bayesian-optimization/BayesianOptimization
A Python implementation of global optimization with gaussian processes.
NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
facebookresearch/xformers
Hackable and optimized Transformers building blocks, supporting a composable construction.
nicksypark/rope-triton
AGI-Edgerunners/LLM-Agents-Papers
A repo lists papers related to LLM based agent
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
forhaoliu/ringattention
Transformers with Arbitrarily Large Context
NVIDIA/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
unslothai/unsloth
Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
jcpeterson/openwebtext
Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.