bjmsong

bjmsong@126.com

Pinned Repositories

AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Language:Python00
bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
Language:Python00
bjmsong.github.io
Language:CSS0 1 00
caffe
Caffe: a fast open framework for deep learning.
Language:C++0 0 00
concurrency-in-python-with-asyncio
Code for the Manning book Concurrency in Python with Asyncio
Language:Python00
cuda_programming
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
Language:Cuda0 0 00
fastllm
纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行
Language:C++1 0 00
hands-on-kernels
High Performance Kernels on GPU/CPU
Language:Python30
mlperf_inference_vllm
Language:C++10
xLLM
A lightweight llama2 inference framework. It can inference llama2-7b with 166+ tokens/s on signle 4090.
Language:C++3 1 01

bjmsong's Repositories

bjmsong/hands-on-kernels
High Performance Kernels on GPU/CPU
Language:Python30
bjmsong/xLLM
A lightweight llama2 inference framework. It can inference llama2-7b with 166+ tokens/s on signle 4090.
Language:C++3 1 01
bjmsong/fastllm
纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行
Language:C++1 0 00
bjmsong/mlperf_inference_vllm
Language:C++10
bjmsong/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Language:Python00
bjmsong/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
Language:Python00
bjmsong/bjmsong.github.io
Language:CSS0 1 00
bjmsong/caffe
Caffe: a fast open framework for deep learning.
Language:C++0 0 00
bjmsong/concurrency-in-python-with-asyncio
Code for the Manning book Concurrency in Python with Asyncio
Language:Python00
bjmsong/cuda_programming
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
Language:Cuda0 0 00
bjmsong/CUDALibrarySamples
CUDA Library Samples
Language:Cuda0 0 00
bjmsong/cutlass
CUDA Templates for Linear Algebra Subroutines
Language:C++0 0 00
bjmsong/DeepLearningFromScratch
《深度学习入门：基于Python的理论与实现》电子版及配套代码。
Language:Python0 0 00
bjmsong/DeepSpeedExamples
Example models using DeepSpeed
Language:Python0 0 00
bjmsong/flash-attention
Fast and memory-efficient exact attention
Language:Python0 0 00
bjmsong/flash_attention_inference
Performance of the C++ interface of flash attention, flash attention v2 and self quantized decoding attention in large language model (LLM) inference scenarios.
Language:C++0 0 00
bjmsong/llama2.c
Inference Llama 2 in one file of pure C
Language:Python0 0 01
bjmsong/flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
Language:Cuda
bjmsong/foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
bjmsong/GPTQ-triton
GPTQ inference Triton kernel
bjmsong/inference_results_v3.1
This repository contains the results and code for the MLPerf™ Inference v3.1 benchmark.
bjmsong/lectures
Material for cuda-mode lectures
Language:Jupyter Notebook
bjmsong/libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
Language:C++0 0
bjmsong/Liger-Kernel
Efficient Triton Kernels for LLM Training
bjmsong/llm.c
LLM training in simple, raw C/CUDA
Language:Cuda
bjmsong/micrograd
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API
Language:Jupyter Notebook0 0
bjmsong/nanoGPT
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Language:Python0 0
bjmsong/sglang
SGLang is a fast serving framework for large language models and vision language models.
Language:Python
bjmsong/triton
Development repository for the Triton language and compiler
Language:C++0 0
bjmsong/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python0 0