Hongbosherlock's Stars
rasbt/LLMs-from-scratch
Implementing a ChatGPT-like LLM in PyTorch from scratch, step by step
wdndev/llm_interview_note
主要记录大语言大模型(LLMs) 算法(应用)工程师相关的知识及面试题
mlabonne/llm-course
Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
HuaizhengZhang/AI-System-School
🚀 AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑💻 Video Tutorials.
FMInference/H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
miss-mumu/developer2gwy
公务员从入门到上岸,最佳程序员公考实践教程
vllm-project/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
pytorch/FBGEMM
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
RussWong/CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
mit-han-lab/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
microsoft/BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
bytedance/decoupleQ
A quantization algorithm for LLM
SqueezeAILab/KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
karpathy/llm.c
LLM training in simple, raw C/CUDA
ray-project/ray
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
AIoT-MLSys-Lab/Efficient-LLMs-Survey
[TMLR 2024] Efficient Large Language Models: A Survey
huggingface/optimum-quanto
A pytorch quantization backend for optimum
ModelTC/llmc
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
ollama/ollama
Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
keith2018/TinyGPT
Tiny C++11 GPT-2 inference implementation from scratch
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
AniZpZ/AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
yihong0618/bilingual_book_maker
Make bilingual epub books Using AI translate
microsoft/TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
gpu-mode/resource-stream
GPU programming related news and material links
all-in-aigc/aicover
ai cover generator
MetaCubeX/mihomo
A simple Python Pydantic model for Honkai: Star Rail parsed data from the Mihomo API.
ml-explore/mlx-examples
Examples in the MLX framework
dvmazur/mixtral-offloading
Run Mixtral-8x7B models in Colab or consumer desktops