Oliver-ss's Stars
mit-han-lab/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Codium-ai/pr-agent
🚀CodiumAI PR-Agent: An AI-Powered 🤖 Tool for Automated Pull Request Analysis, Feedback, Suggestions and More! 💻🔍
NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
NVIDIA/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
DIYgod/RSSHub
🧡 Everything is RSSible
feeddd/feeds
免费的公众号 RSS,支持扩展任意 APP
flexflow/FlexFlow
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
ray-project/ray
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
vzhd1701/evernote-backup
Backup & export all Evernote notes and notebooks
krahets/hello-algo
《Hello 算法》:动画图解、一键运行的数据结构与算法教程。支持 Python, Java, C++, C, C#, JS, Go, Swift, Rust, Ruby, Kotlin, TS, Dart 代码。简体版和繁体版同步更新,English version ongoing
anyscale/llm-continuous-batching-benchmarks
ray-project/ray-llm
RayLLM - LLMs on Ray
bytedance/effective_transformer
Running BERT without Padding
mit-han-lab/llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
tlc-pack/cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
guidance-ai/guidance
A guidance language for controlling large language models.
hkust-nlp/ceval
Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]
LazyVim/LazyVim
Neovim config for the lazy
run-llama/llama_index
LlamaIndex is a data framework for your LLM applications
godweiyang/NN-CUDA-Example
Several simple examples for popular neural network toolkits calling custom CUDA operators.
tpoisonooo/llama.onnx
LLaMa/RWKV onnx models, quantization and testcase
ymcui/Chinese-LLaMA-Alpaca
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
oobabooga/text-generation-webui
A Gradio web UI for Large Language Models.
triton-lang/triton
Development repository for the Triton language and compiler
huggingface/text-generation-inference
Large Language Model Text Generation Inference