xiaoguoer's Stars
ml-explore/mlx
MLX: An array framework for Apple silicon
abetlen/llama-cpp-python
Python bindings for llama.cpp
SJTU-IPADS/PowerInfer
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
microsoft/DeepSpeedExamples
Example models using DeepSpeed
yangjianxin1/Firefly
Firefly: 大模型训练工具,支持训练Qwen2.5、Qwen2、Yi1.5、Phi-3、Llama3、Gemma、MiniCPM、Yi、Deepseek、Orion、Xverse、Mixtral-8x7B、Zephyr、Mistral、Baichuan2、Llma2、Llama、Qwen、Baichuan、ChatGLM2、InternLM、Ziya2、Vicuna、Bloom等大模型
TimDettmers/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
AI4Finance-Foundation/ElegantRL
Massively Parallel Deep Reinforcement Learning. 🔥
turboderp/exllamav2
A fast inference library for running LLMs locally on modern consumer-class GPUs
turboderp/exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
huggingface/optimum
🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
IST-DASLab/gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
microsoft/DeepSpeed-MII
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
casper-hansen/AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
intel/intel-extension-for-pytorch
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Tencent/TurboTransformers
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
horseee/Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
kuleshov-group/llmtools
Finetuning Large Language Models on One Consumer GPU in 2 Bits
feifeibear/LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
bytedance/effective_transformer
Running BERT without Padding
intel/xFasterTransformer
neuralmagic/sparsezoo
Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes
cli99/llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
intel/neural-speed
An innovative library for efficient LLM inference via low-bit quantization
huggingface/optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
inferflow/inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
wejoncy/QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
FreedomIntelligence/FastLLM
Fast LLM Training CodeBase With dynamic strategy choosing [Deepspeed+Megatron+FlashAttention+CudaFusionKernel+Compiler];