Pinned Repositories
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
bjmsong.github.io
caffe
Caffe: a fast open framework for deep learning.
concurrency-in-python-with-asyncio
Code for the Manning book Concurrency in Python with Asyncio
cuda_programming
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
fastllm
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
hands-on-kernels
High Performance Kernels on GPU/CPU
mlperf_inference_vllm
xLLM
A lightweight llama2 inference framework. It can inference llama2-7b with 166+ tokens/s on signle 4090.
bjmsong's Repositories
bjmsong/hands-on-kernels
High Performance Kernels on GPU/CPU
bjmsong/xLLM
A lightweight llama2 inference framework. It can inference llama2-7b with 166+ tokens/s on signle 4090.
bjmsong/fastllm
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
bjmsong/mlperf_inference_vllm
bjmsong/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
bjmsong/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
bjmsong/bjmsong.github.io
bjmsong/caffe
Caffe: a fast open framework for deep learning.
bjmsong/concurrency-in-python-with-asyncio
Code for the Manning book Concurrency in Python with Asyncio
bjmsong/cuda_programming
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
bjmsong/CUDALibrarySamples
CUDA Library Samples
bjmsong/cutlass
CUDA Templates for Linear Algebra Subroutines
bjmsong/DeepLearningFromScratch
《深度学习入门:基于Python的理论与实现》电子版及配套代码。
bjmsong/DeepSpeedExamples
Example models using DeepSpeed
bjmsong/flash-attention
Fast and memory-efficient exact attention
bjmsong/flash_attention_inference
Performance of the C++ interface of flash attention, flash attention v2 and self quantized decoding attention in large language model (LLM) inference scenarios.
bjmsong/llama2.c
Inference Llama 2 in one file of pure C
bjmsong/flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
bjmsong/foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
bjmsong/GPTQ-triton
GPTQ inference Triton kernel
bjmsong/inference_results_v3.1
This repository contains the results and code for the MLPerf™ Inference v3.1 benchmark.
bjmsong/lectures
Material for cuda-mode lectures
bjmsong/libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
bjmsong/Liger-Kernel
Efficient Triton Kernels for LLM Training
bjmsong/llm.c
LLM training in simple, raw C/CUDA
bjmsong/micrograd
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API
bjmsong/nanoGPT
The simplest, fastest repository for training/finetuning medium-sized GPTs.
bjmsong/sglang
SGLang is a fast serving framework for large language models and vision language models.
bjmsong/triton
Development repository for the Triton language and compiler
bjmsong/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs