cli99
@black-forest-labs. ex-Databricks Mosaic AI. ex-Microsoft DeepSpeed. UIUC PhD. I build efficient AI training and inference systems with GPUs.
Black Forest Labs
Pinned Repositories
academic-kickstart
Cheng's Website
accelerate
🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision
acm-ccs
ACM Computing Classification System
bleve-search-react
cermine-docker
composer
Supercharge Your Model Training
flops-profiler
pytorch-profiler
hugo-cli99
llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
cli99's Repositories
cli99/llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
cli99/composer
Supercharge Your Model Training
cli99/attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
cli99/AutoFP8
cli99/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
cli99/cutlass
CUDA Templates for Linear Algebra Subroutines
cli99/DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
cli99/Diffusion-Models-pytorch
Pytorch implementation of Diffusion Models (https://arxiv.org/pdf/2006.11239.pdf)
cli99/KVQuant
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
cli99/llm-awq
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
cli99/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
cli99/llm-foundry
LLM training code for MosaicML foundation models
cli99/llm-scripts
cli99/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
cli99/megablocks
cli99/minRF
Minimal implementation of scalable rectified flow transformers, based on SD3's approach
cli99/nccl-tests
NCCL Tests
cli99/quant-matmul
cli99/sampleproject
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"
cli99/starter-academic
cli99/stk
cli99/superbenchmark
A validation and profiling tool for AI infrastructure
cli99/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
cli99/theme-academic-cv
🎓 无需编写任何代码即可轻松创建漂亮的学术网站 Easily create a beautiful academic résumé or educational website using Hugo and GitHub. No code.
cli99/torchtitan
A native PyTorch Library for large model training
cli99/transformer_framework
framework for plug and play of various transformers (vision and nlp) with FSDP
cli99/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
cli99/transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
cli99/VILA
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
cli99/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs