pineleen's Stars
yunjey/pytorch-tutorial
PyTorch Tutorial for Deep Learning Researchers
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
sustcsonglin/flash-linear-attention
Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
xdit-project/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
apple/ml-stable-diffusion
Stable Diffusion with Core ML on Apple Silicon
bergkamp/video-comparison-player
🎦 Video comparison player for Mac and Windows, built using Electron
NVIDIA/nccl-tests
NCCL Tests
tinygrad/open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
dottxt-ai/outlines
Structured Text Generation
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
NVIDIA/cutlass
CUDA Templates for Linear Algebra Subroutines
huggingface/peft
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
xorbitsai/inference
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
bytedance/ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware.
mlcommons/inference
Reference implementations of MLPerf™ inference benchmarks
Lyken17/pytorch-OpCounter
Count the MACs / FLOPs of your PyTorch model.
flexflow/FlexFlow
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
Liu-xiandong/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
langgenius/dify
Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
NVIDIA/cudnn-frontend
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
ollama/ollama
Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.
openai/simple-evals
0voice/learning_mind_map
2021年【思维导图】盒子,C/C++,Golang,Linux,云原生,数据库,DPDK,音视频开发,TCP/IP,数据结构,计算机原理等
Mozilla-Ocho/llamafile
Distribute and run LLMs with a single file.
karpathy/llm.c
LLM training in simple, raw C/CUDA
ztxz16/fastllm
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
megvii-research/Sparsebit
A model compression and acceleration toolbox based on pytorch.
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.